Giter VIP home page Giter VIP logo

alex-lekov / automl_alex Goto Github PK

View Code? Open in Web Editor NEW
207.0 7.0 42.0 26.54 MB

State-of-the art Automated Machine Learning python library for Tabular Data

License: MIT License

Python 99.69% Dockerfile 0.31%
automl auto-ml optimisation ml cross-validation xgboost stacking stacking-ensemble hyperparameter-optimization hyperparameter-tuning sklearn python machine-learning machine-learning-library machine-learning-models data-science data-science-projects model-selection automatic-machine-learning

automl_alex's Introduction

AutoML Alex

Downloads PyPI - Python Version PyPI CodeFactor Telegram License


State-of-the art Automated Machine Learning python library for Tabular Data

Works with Tasks:

  • Binary Classification

  • Regression

  • Multiclass Classification (in progress...)

Benchmark Results

bench

The bigger, the better
From AutoML-Benchmark

Scheme

scheme

Features

  • Automated Data Clean (Auto Clean)
  • Automated Feature Engineering (Auto FE)
  • Smart Hyperparameter Optimization (HPO)
  • Feature Generation
  • Feature Selection
  • Models Selection
  • Cross Validation
  • Optimization Timelimit and EarlyStoping
  • Save and Load (Predict new data)

Installation

pip install automl-alex

Docs

DocPage

๐Ÿš€ Examples

Classifier:

from automl_alex import AutoMLClassifier

model = AutoMLClassifier()
model.fit(X_train, y_train, timeout=600)
predicts = model.predict(X_test)

Regression:

from automl_alex import AutoMLRegressor

model = AutoMLRegressor()
model.fit(X_train, y_train, timeout=600)
predicts = model.predict(X_test)

DataPrepare:

from automl_alex import DataPrepare

de = DataPrepare()
X_train = de.fit_transform(X_train)
X_test = de.transform(X_test)

Simple Models Wrapper:

from automl_alex import LightGBMClassifier

model = LightGBMClassifier()
model.fit(X_train, y_train)
predicts = model.predict_proba(X_test)

model.opt(X_train, y_train,
    timeout=600, # optimization time in seconds,
    )
predicts = model.predict_proba(X_test)

More examples in the folder ./examples:

What's inside

It integrates many popular frameworks:

  • scikit-learn
  • XGBoost
  • LightGBM
  • CatBoost
  • Optuna
  • ...

Works with Features

  • Categorical Features

  • Numerical Features

  • Binary Features

  • Text

  • Datetime

  • Timeseries

  • Image

Note

  • With a large dataset, a lot of memory is required! Library creates many new features. If you have a large dataset with a large number of features (more than 100), you may need a lot of memory.

Realtime Dashboard

Works with optuna-dashboard

Dashboard

Run

$ optuna-dashboard sqlite:///db.sqlite3

Road Map

  • Feature Generation

  • Save/Load and Predict on New Samples

  • Advanced Logging

  • Add opt Pruners

  • Docs Site

  • DL Encoders

  • Add More libs (NNs)

  • Multiclass Classification

  • Build pipelines

Contact

Telegram Group

automl_alex's People

Contributors

alex-lekov avatar deepsource-autofix[bot] avatar deepsourcebot avatar itlek avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

automl_alex's Issues

How can i solve "Columns must be same length as key" error?

What i tried :

de = DataPrepare(
                num_generator_features=True, # Generator interaction Num Features
                # operations_num_generator=['/','*','-',],
                )
clean_X_train = de.fit_transform(train_X_all)
de = DataPrepare(clean_and_encod_data=True,
                # cat_encoder_names=['HelmertEncoder','OneHotEncoder'], # Encoders list for Generator cat encodet features
                clean_nan=True, # fillnan
                clean_outliers=True, # method='IQR', threshold=2,
                drop_invariant=True, # drop invariant features (data.nunique < 2)
                num_generator_features=True, # Generator interaction Num Features
                num_denoising_autoencoder=True, # denoising_autoencoder if num features > 2
                normalization=True, # normalization data (StandardScaler)
                cat_features=None, # DataPrepare can auto detect categorical features
                random_state=42,
                verbose=3)

clean_X_train = de.fit_transform(train_X_all)

but getting error :

/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py in _setitem_array(self, key, value)
   3065             if isinstance(value, DataFrame):
   3066                 if len(value.columns) != len(key):
-> 3067                     raise ValueError("Columns must be same length as key")
   3068                 for k1, k2 in zip(key, value.columns):
   3069                     self[k1] = value[k2]

ValueError: Columns must be same length as key

Running Env : Google Colab

I cant understand why i'm getting "ValueError: Columns must be same length as key"

Is there anything to fix in my code or data?

I'm attaching my train data.

2021_04_22_orgianl_agg_heEncoded_looEncoded_df.zip

Thank you. I'm using well

TypeError("unsupported operand type(s) for +: 'int' and 'str'")

Is it intended that this error happens ?

  File "/home/../python3.7/site-packages/automl_alex/data_prepare.py", line 1050, in fit_transform
    data = self._clean_outliers_enc.transform(data)
  File "/home/../python3.7/site-packages/automl_alex/data_prepare.py", line 693, in transform
    feature_name = weight_values + "_Is_Outliers_" + self.method
TypeError: unsupported operand type(s) for +: 'int' and 'str'

Want to use Databunch in Featurewiz

Hi Alex:
Thanks for your excellent AutoML repo! I was looking at your project and found that your DataBunch module was very quickly able to create hundreds of additional features on train and test. I am likely to re-use that module in the new "featurewiz" library I have created. I would be citing your MIT license and credit your Guithub (of course). Just FYI.

Please take a look at the featurewiz library here:
https://github.com/AutoViML/featurewiz

Once again, thank you for sharing your knowledge with everyone.
Ram

Error during training

I'm trying to use automl_alex to make a baseline for this task, but this triggered an error.

my code is:

import pandas as pd
from automl_alex import AutoMLRegressor

X_train = df[df.columns.difference(['y'], sort=False)]
y_train = df.y
X_test = pd.read_csv('./data/test.csv', index_col='ID')

model = AutoMLRegressor(X_train, y_train, X_test, cat_features=X_train.columns, verbose=1)

%%time
predict_test, predict_train = model.fit_predict(verbose=2)

Step 1: Model 0


100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 1/1 [00:14<00:00, 14.74s/it]

Model 1
One iteration takes ~ 4.5 sec

Start Auto calibration parameters
[I 2020-11-05 11:35:57,966] A new study created in memory with name: no-name-ecd42a72-44ac-442e-9b71-d621e2a383ea
Start optimization with the parameters:
CV_Folds = 5
Score_CV_Folds = 2
Feature_Selection = True
Opt_lvl = 2
Cold_start = 44.0
Early_stoping = 100
Metric = mean_squared_error
Direction = minimize
##################################################
Default model OptScore = 109.2303
Optimize: : 55it [20:47, 22.68s/it, | Model: ExtraTrees | OptScore: 105.3311 | Best mean_squared_error: 88.854 +- 16.477105]

Predict from Models_1
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 3/3 [01:27<00:00, 29.22s/it]
0%| | 0/1 [00:00<?, ?it/s]

Calc predict policy on Models_1:
| posible_repeats: 0 | stack_top: 1 | n_repeats: 1
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 1/1 [02:28<00:00, 148.44s/it]

Mean Score mean_squared_error on 5 Folds: 76.0092 std: 15.236319

Models_1 Mean mean_squared_error Score Train: 76.0097

Model 2

One iteration takes ~ 10.6 sec

Start Auto calibration parameters
Start optimization with the parameters:
CV_Folds = 5
Score_CV_Folds = 1
Feature_Selection = True
Opt_lvl = 1
Cold_start = 10
Early_stoping = 50
Metric = mean_squared_error
Direction = minimize
##################################################
Default model OptScore = 75.9577
Optimize: : 16it [02:44, 24.17s/it, | Model: MLP | OptScore: 109.1348 | Best mean_squared_error: 109.1348 ]

stack trace is:
/home/user/Projects/p0/venv/lib/python3.7/site-packages/sklearn/utils/extmath.py:153: RuntimeWarning: overflow encountered in matmul
ret = a @ b
/home/user/Projects/p0/venv/lib/python3.7/site-packages/sklearn/utils/extmath.py:153: RuntimeWarning: invalid value encountered in matmul
ret = a @ b
Trial 16 failed because of the following error: ValueError("Input contains NaN, infinity or a value too large for dtype('float64').")
Traceback (most recent call last):
File "/home/user/Projects/p0/venv/lib/python3.7/site-packages/optuna/study.py", line 799, in _run_trial
result = func(trial)
File "/home/user/Projects/p0/venv/lib/python3.7/site-packages/automl_alex/models/base.py", line 420, in objective
**data_kwargs,
File "/home/user/Projects/p0/venv/lib/python3.7/site-packages/automl_alex/models/base.py", line 762, in cross_val_score
res = self.cross_val(predict=False,**kwargs)
File "/home/user/Projects/p0/venv/lib/python3.7/site-packages/automl_alex/models/base.py", line 700, in cross_val
y_test=val_y.reset_index(drop=True),
File "/home/user/Projects/p0/venv/lib/python3.7/site-packages/automl_alex/models/sklearn_models.py", line 88, in _fit
model.model.fit(X_train, y_train,)
File "/home/user/Projects/p0/venv/lib/python3.7/site-packages/sklearn/neural_network/_multilayer_perceptron.py", line 641, in fit
return self._fit(X, y, incremental=False)
File "/home/user/Projects/p0/venv/lib/python3.7/site-packages/sklearn/neural_network/_multilayer_perceptron.py", line 371, in _fit
intercept_grads, layer_units, incremental)
File "/home/user/Projects/p0/venv/lib/python3.7/site-packages/sklearn/neural_network/_multilayer_perceptron.py", line 554, in _fit_stochastic
self._update_no_improvement_count(early_stopping, X_val, y_val)
File "/home/user/Projects/p0/venv/lib/python3.7/site-packages/sklearn/neural_network/_multilayer_perceptron.py", line 597, in update_no_improvement_count
self.validation_scores
.append(self.score(X_val, y_val))
File "/home/user/Projects/p0/venv/lib/python3.7/site-packages/sklearn/base.py", line 552, in score
return r2_score(y, y_pred, sample_weight=sample_weight)
File "/home/user/Projects/p0/venv/lib/python3.7/site-packages/sklearn/utils/validation.py", line 72, in inner_f
return f(**kwargs)
File "/home/user/Projects/p0/venv/lib/python3.7/site-packages/sklearn/metrics/_regression.py", line 589, in r2_score
y_true, y_pred, multioutput)
File "/home/user/Projects/p0/venv/lib/python3.7/site-packages/sklearn/metrics/_regression.py", line 86, in _check_reg_targets
y_pred = check_array(y_pred, ensure_2d=False, dtype=dtype)
File "/home/user/Projects/p0/venv/lib/python3.7/site-packages/sklearn/utils/validation.py", line 72, in inner_f
return f(**kwargs)
File "/home/user/Projects/p0/venv/lib/python3.7/site-packages/sklearn/utils/validation.py", line 645, in check_array
allow_nan=force_all_finite == 'allow-nan')
File "/home/user/Projects/p0/venv/lib/python3.7/site-packages/sklearn/utils/validation.py", line 99, in _assert_all_finite
msg_dtype if msg_dtype is not None else X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

python: 3.7.4
ubuntu: Ubuntu 18.04.5 LTS
packages installed:
alembic==1.4.3 argon2-cffi==20.1.0 async-generator==1.10 attrs==20.2.0 automl-alex==0.10.7 backcall==0.2.0 bleach==3.2.1 catboost==0.24.2 category-encoders==2.2.2 certifi==2020.6.20 cffi==1.14.3 chardet==3.0.4 cliff==3.4.0 cmaes==0.7.0 cmd2==1.3.11 colorama==0.4.4 colorlog==4.4.0 cycler==0.10.0 decorator==4.4.2 defusedxml==0.6.0 entrypoints==0.3 graphviz==0.14.2 idna==2.10 importlib-metadata==2.0.0 ipykernel==5.3.4 ipython==7.19.0 ipython-genutils==0.2.0 jedi==0.17.2 Jinja2==2.11.2 joblib==0.17.0 json5==0.9.5 jsonschema==3.2.0 jupyter-client==6.1.7 jupyter-core==4.6.3 jupyterlab==2.2.9 jupyterlab-pygments==0.1.2 jupyterlab-server==1.2.0 kiwisolver==1.3.1 lightgbm==3.0.0 Mako==1.1.3 MarkupSafe==1.1.1 matplotlib==3.3.2 mistune==0.8.4 nbclient==0.5.1 nbconvert==6.0.7 nbformat==5.0.8 nest-asyncio==1.4.2 notebook==6.1.4 numpy==1.19.4 optuna==2.2.0 packaging==20.4 pandas==1.1.4 pandocfilters==1.4.3 parso==0.7.1 patsy==0.5.1 pbr==5.5.1 pexpect==4.8.0 pickleshare==0.7.5 Pillow==8.0.1 plotly==4.12.0 prettytable==0.7.2 prometheus-client==0.8.0 prompt-toolkit==3.0.8 ptyprocess==0.6.0 pycparser==2.20 Pygments==2.7.2 pyparsing==2.4.7 pyperclip==1.8.1 pyrsistent==0.17.3 python-dateutil==2.8.1 python-editor==1.0.4 pytz==2020.4 PyYAML==5.3.1 pyzmq==19.0.2 requests==2.24.0 retrying==1.3.3 scikit-learn==0.23.2 scipy==1.5.3 seaborn==0.11.0 Send2Trash==1.5.0 six==1.15.0 SQLAlchemy==1.3.20 statsmodels==0.12.1 stevedore==3.2.2 terminado==0.9.1 testpath==0.4.4 threadpoolctl==2.1.0 tornado==6.1 tqdm==4.51.0 traitlets==5.0.5 urllib3==1.25.11 wcwidth==0.2.5 webencodings==0.5.1 xgboost==1.2.1 zipp==3.4.0

does it work on spectral data?

Hello,

I'm wondering if you have tested it on spectral data? I have NIR Spectroscopy data with 125 variables and using presently TPOT, works very well but looking at your benchmark it seems your approach works better for the kind of data you have developed it for. hence do you recommend testing it on my spectral data or it's not the purpose of it?

thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.