Giter VIP home page Giter VIP logo

alpha-automl's Introduction

PyPI version License Tests Documentation Status

Alpha-AutoML is an AutoML system that automatically searches for models and derives end-to-end pipelines that read, pre-process the data, and train the model. Alpha-AutoML leverages recent advances in deep reinforcement learning and is able to adapt to different application domains and problems through incremental learning.

Alpha-AutoML provides data scientists and data engineers the flexibility to address complex problems by leveraging the Python ecosystem, including open-source libraries and tools, support for collaboration, and infrastructure that enables transparency and reproducibility.

This repository is part of New York University's implementation of the Data Driven Discovery project (D3M).

Documentation

Documentation is available here.

Installation

This package works with Python 3.6+ in Linux, Mac, and Windows.

You can install the latest stable version of this library from PyPI:

pip install alpha-automl

To install the latest development version:

pip install git+https://github.com/VIDA-NYU/alpha-automl@devel

Docker

Pre-built Docker Image

We provide pre-built docker images with Jupyter and Alpha-AutoML pre-installed that you can use to quickly test Alpha-AutoML. To test it, you can run the following command in your machine, and open Jupyter Notebook on your browser:

docker run -p 8888:8888 ghcr.io/vida-nyu/alpha-automl

Using this command, Jupyter Notebook will auto-generate a security token. The correct URL to access the Jupyter will be printed in the console output and will look like: http://127.0.0.1:8888/?token=70ace7fa017c35ba0134dc7931add12bf55a69d4d4e6e54f.

Alternatively, if you want to provide a custom security token, you can run:

docker run -p 8888:8888 -e JUPYTER_TOKEN="<my-token>" ghcr.io/vida-nyu/alpha-automl

If you are running the Jupyter Notebook in a secure environment, the authentication can be disabled as follows:

docker run -p 8888:8888 ghcr.io/vida-nyu/alpha-automl --NotebookApp.token=''

Docker Image From Scratch

If you need to build an image from sources, you can use our Dockerfile. You can use a docker-build argument to select the packages that will be installed in the image (e.g., full, timeseries, nlp, etc) as follows:

docker build -t alpha-automl --build-arg BUILD_OPTION=full .

Or simply a base version using (this will use less disk space but will not provide support for some tasks such as NLP and timeseries):

docker build -t alpha-automl:latest --target alpha-automl .

You can also build an image to use with JupyterHub as follows:

docker build -t alpha-automl:latest-jupyterhub --target alpha-automl-jupyterhub .

See also the documentation on how to setup Alpha-AutoML + JupyterHub on Kubernetes.

Others

Documentation for the Streamlit app for image triage developed by Jataware Corp is available here, see this video demo.

alpha-automl's People

Contributors

aecio avatar edenwuyifan avatar laibamehnaz avatar madhuripujari95 avatar remram44 avatar roquelopez avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

alpha-automl's Issues

Sampled data having 1 offset, makes it unable to locate the datetime column

I'm training my modeling using a sampled pandas dataset. The datetime is at column 0. However, the algorithm keeps locating the column to column 1. It is not fixed after I ran X_train = X_train.reset_index(drop=True)

The dataset is like this:
image
image

INFO:alpha_automl.pipeline_synthesis.pipeline_builder:New pipelined created:
Pipeline(steps=[('sklearn.compose.ColumnTransformer',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('sklearn.preprocessing.OrdinalEncoder',
                                                  OrdinalEncoder(handle_unknown='use_encoded_value',
                                                                 unknown_value=-1),
                                                  [1])])),
                ('sklearn.preprocessing.RobustScaler', RobustScaler()),
                ('sklearn.feature_selection.SelectKBest', SelectKBest()),
                ('sklearn.neighbors.KNeighborsRegressor',
                 KNeighborsRegressor())])
WARNING:alpha_automl.scorer:Exception scoring a pipeline
WARNING:alpha_automl.scorer:Detailed error:
Traceback (most recent call last):
  File "/ext3/miniconda3/lib/python3.10/site-packages/alpha_automl/scorer.py", line 73, in score_pipeline
    scores = cross_val_score(pipeline, X, y, cv=splitting_strategy, scoring=scoring, error_score='raise')
  File "/ext3/miniconda3/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 515, in cross_val_score
    cv_results = cross_validate(
  File "/ext3/miniconda3/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 266, in cross_validate
    results = parallel(
  File "/ext3/miniconda3/lib/python3.10/site-packages/sklearn/utils/parallel.py", line 63, in __call__
    return super().__call__(iterable_with_config)
  File "/ext3/miniconda3/lib/python3.10/site-packages/joblib/parallel.py", line 1085, in __call__
    if self.dispatch_one_batch(iterator):
  File "/ext3/miniconda3/lib/python3.10/site-packages/joblib/parallel.py", line 901, in dispatch_one_batch
    self._dispatch(tasks)
  File "/ext3/miniconda3/lib/python3.10/site-packages/joblib/parallel.py", line 819, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/ext3/miniconda3/lib/python3.10/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
    result = ImmediateResult(func)
  File "/ext3/miniconda3/lib/python3.10/site-packages/joblib/_parallel_backends.py", line 597, in __init__
    self.results = batch()
  File "/ext3/miniconda3/lib/python3.10/site-packages/joblib/parallel.py", line 288, in __call__
    return [func(*args, **kwargs)
  File "/ext3/miniconda3/lib/python3.10/site-packages/joblib/parallel.py", line 288, in <listcomp>
    return [func(*args, **kwargs)
  File "/ext3/miniconda3/lib/python3.10/site-packages/sklearn/utils/parallel.py", line 123, in __call__
    return self.function(*args, **kwargs)
  File "/ext3/miniconda3/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/ext3/miniconda3/lib/python3.10/site-packages/sklearn/pipeline.py", line 401, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "/ext3/miniconda3/lib/python3.10/site-packages/sklearn/pipeline.py", line 359, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "/ext3/miniconda3/lib/python3.10/site-packages/joblib/memory.py", line 349, in __call__
    return self.func(*args, **kwargs)
  File "/ext3/miniconda3/lib/python3.10/site-packages/sklearn/pipeline.py", line 893, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "/ext3/miniconda3/lib/python3.10/site-packages/sklearn/utils/_set_output.py", line 140, in wrapped
    data_to_wrap = f(self, X, *args, **kwargs)
  File "/ext3/miniconda3/lib/python3.10/site-packages/sklearn/base.py", line 881, in fit_transform
    return self.fit(X, y, **fit_params).transform(X)
  File "/ext3/miniconda3/lib/python3.10/site-packages/sklearn/preprocessing/_data.py", line 1516, in fit
    X = self._validate_data(
  File "/ext3/miniconda3/lib/python3.10/site-packages/sklearn/base.py", line 565, in _validate_data
    X = check_array(X, input_name="X", **check_params)
  File "/ext3/miniconda3/lib/python3.10/site-packages/sklearn/utils/validation.py", line 879, in check_array
    array = _asarray_with_order(array, order=order, dtype=dtype, xp=xp)
  File "/ext3/miniconda3/lib/python3.10/site-packages/sklearn/utils/_array_api.py", line 185, in _asarray_with_order
    array = numpy.asarray(array, order=order, dtype=dtype)
ValueError: could not convert string to float: '2017-10-13 13:00:00'

Import Error

Can't Import Alpha AutoML

from alpha_automl import AutoMLClassifier
import pandas as pd

output
`ImportError Traceback (most recent call last)
Cell In[1], line 1
----> 1 from alpha_automl import AutoMLClassifier
2 import pandas as pd

File ~\miniconda3\envs\alpha\Lib\site-packages\alpha_automl_init_.py:3
1 version = '0.2.0'
----> 3 from .automl_api import AutoMLClassifier, AutoMLRegressor

File ~\miniconda3\envs\alpha\Lib\site-packages\alpha_automl\automl_api.py:7
5 import pandas as pd
6 from sklearn.preprocessing import LabelEncoder
----> 7 from alpha_automl.automl_manager import AutoMLManager
8 from alpha_automl.scorer import make_scorer, make_splitter, make_str_metric
9 from alpha_automl.utils import make_d3m_pipelines

File ~\miniconda3\envs\alpha\Lib\site-packages\alpha_automl\automl_manager.py:6
4 from multiprocessing import set_start_method
5 from alpha_automl.utils import sample_dataset, is_equal_splitting
----> 6 from alpha_automl.scorer import make_splitter, score_pipeline
7 from alpha_automl.pipeline_synthesis.setup_search import search_pipelines as search_pipelines_proc
10 USE_AUTOMATIC_GRAMMAR = False

File ~\miniconda3\envs\alpha\Lib\site-packages\alpha_automl\scorer.py:4
2 import datetime
3 import numpy as np
----> 4 from sklearn.metrics import SCORERS, get_scorer, make_scorer as make_scorer_sk
5 from sklearn.model_selection import BaseCrossValidator, KFold, ShuffleSplit, train_test_split, cross_val_score
6 from sklearn.model_selection._split import BaseShuffleSplit, _RepeatedSplits

ImportError: cannot import name 'SCORERS' from 'sklearn.metrics' (C:\Users\USER\miniconda3\envs\alpha\Lib\site-packages\sklearn\metrics_init_.py`

Execution stops after creating Board when using MyEmbedder.

On running python adding_new_primitives_huggingface.py, the execution stops here.

DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /bert-base-uncased/resolve/main/vocab.txt HTTP/1.1" 200 0
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /bert-base-uncased/resolve/main/config.json HTTP/1.1" 200 0
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
/ext3/miniconda3/lib/python3.10/site-packages/sklearn/preprocessing/_label.py:116: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
INFO:datamart_profiler.core:Setting column names from header
INFO:datamart_profiler.core:Identifying types, 4 columns...
INFO:datamart_profiler.core:Processing column 0 'text'...
INFO:datamart_profiler.core:Column type http://schema.org/Text [http://schema.org/Text]
INFO:datamart_profiler.core:Processing column 1 'Time of Tweet'...
INFO:datamart_profiler.core:Column type http://schema.org/Text [http://schema.org/Enumeration]
INFO:datamart_profiler.core:Processing column 2 'Age of User'...
INFO:datamart_profiler.core:Column type http://schema.org/Text [http://schema.org/Enumeration]
INFO:datamart_profiler.core:Processing column 3 'Country'...
INFO:datamart_profiler.core:Column type http://schema.org/Text [http://schema.org/Enumeration]
INFO:alpha_automl.data_profiler:Results of profiling data: non-numeric features = dict_keys(['TEXT_ENCODER', 'CATEGORICAL_ENCODER']), useless columns = [], missing values = True
INFO:alpha_automl.utils:Sampling down data from 27481 to 2000
INFO:alpha_automl.pipeline_synthesis.setup_search:Creating a manual grammar
INFO:alpha_automl.primitive_loader:Hierarchy of all primitives loaded
INFO:alpha_automl.grammar_loader:Creating task grammar for task CLASSIFICATION_TASK
INFO:alpha_automl.grammar_loader:Task grammar: Grammar with 31 productions (start state = S)
    S -> IMPUTATION ENCODERS FEATURE_SCALING FEATURE_SELECTION CLASSIFICATION
    ENCODERS -> TEXT_ENCODER CATEGORICAL_ENCODER
    IMPUTATION -> 'sklearn.impute.SimpleImputer'
    FEATURE_SCALING -> 'sklearn.preprocessing.MaxAbsScaler'
    FEATURE_SCALING -> 'sklearn.preprocessing.RobustScaler'
    FEATURE_SCALING -> 'sklearn.preprocessing.StandardScaler'
    FEATURE_SCALING -> 'E'
    FEATURE_SELECTION -> 'sklearn.feature_selection.GenericUnivariateSelect'
    FEATURE_SELECTION -> 'sklearn.feature_selection.SelectPercentile'
    FEATURE_SELECTION -> 'sklearn.feature_selection.SelectKBest'
    FEATURE_SELECTION -> 'E'
    TEXT_ENCODER -> 'sklearn.feature_extraction.text.CountVectorizer'
    TEXT_ENCODER -> 'sklearn.feature_extraction.text.TfidfVectorizer'
    TEXT_ENCODER -> 'my_module.MyEmbedder'
    CATEGORICAL_ENCODER -> 'sklearn.preprocessing.OneHotEncoder'
    CLASSIFICATION -> 'sklearn.discriminant_analysis.LinearDiscriminantAnalysis'
    CLASSIFICATION -> 'sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis'
    CLASSIFICATION -> 'sklearn.ensemble.BaggingClassifier'
    CLASSIFICATION -> 'sklearn.ensemble.ExtraTreesClassifier'
    CLASSIFICATION -> 'sklearn.ensemble.GradientBoostingClassifier'
    CLASSIFICATION -> 'sklearn.ensemble.RandomForestClassifier'
    CLASSIFICATION -> 'sklearn.naive_bayes.BernoulliNB'
    CLASSIFICATION -> 'sklearn.naive_bayes.GaussianNB'
    CLASSIFICATION -> 'sklearn.naive_bayes.MultinomialNB'
    CLASSIFICATION -> 'sklearn.neighbors.KNeighborsClassifier'
    CLASSIFICATION -> 'sklearn.linear_model.LogisticRegression'
    CLASSIFICATION -> 'sklearn.linear_model.PassiveAggressiveClassifier'
    CLASSIFICATION -> 'sklearn.linear_model.SGDClassifier'
    CLASSIFICATION -> 'sklearn.svm.LinearSVC'
    CLASSIFICATION -> 'sklearn.svm.SVC'
    CLASSIFICATION -> 'sklearn.tree.DecisionTreeClassifier'
INFO:alpha_automl.grammar_loader:Creating game grammar
INFO:alpha_automl.pipeline_search.Coach:------ITER 1------
INFO:alpha_automl.pipeline_search.MCTS:MCTS SIMULATION 1
INFO:alpha_automl.pipeline_search.pipeline.PipelineGame:PIPELINE: S
/scratch/lm4428/d3m_latest/alpha-automl/alpha_automl/pipeline_search/pipeline/NNet.py:103: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
  board = Variable(board, volatile=True)

Output of automl.fit() with verbose=False is too verbose

The output of automl.fit(X_train, y_train) is too verbose even when using verbose=False. Some of the output seems to be coming from primitives that we have no control over (see e.g. the output below taken from the semi_supervised_classification_example notebook). The flag verbose=False should imply that alpha-automl only shows human-readable progress output produced by alpha-automl. Maybe consider only showing a progress bar using tqdm-notebook.

automl = AutoMLSemiSupervisedClassifier(output_path, time_bound=10, verbose=False,
                                        split_strategy_kwargs={'test_size':.2})
automl.fit(X_train, y_train)
INFO:alpha_automl.automl_api:Found pipeline, time=0:00:04, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.9136242208370436
INFO:gluonts.mx.context:Using CPU
INFO:gluonts.mx.context:Using CPU
INFO:alpha_automl.automl_api:Found pipeline, time=0:00:11, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.778272484416741
INFO:alpha_automl.automl_api:Found pipeline, time=0:00:11, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.7800534283170079
INFO:alpha_automl.automl_api:Found pipeline, time=0:00:19, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.8076580587711487
INFO:alpha_automl.automl_api:Found pipeline, time=0:00:40, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.9082813891362422
INFO:alpha_automl.automl_api:Found pipeline, time=0:01:02, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.8806767586821014
INFO:alpha_automl.automl_api:Found pipeline, time=0:01:03, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.9091718610863758
INFO:alpha_automl.automl_api:Found pipeline, time=0:01:03, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.8263579697239537
INFO:alpha_automl.automl_api:Found pipeline, time=0:01:06, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=1.0
INFO:alpha_automl.automl_api:Found pipeline, time=0:01:08, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.9839715048975958
INFO:alpha_automl.automl_api:Found pipeline, time=0:01:08, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.7880676758682101
INFO:alpha_automl.automl_api:Found pipeline, time=0:01:11, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=1.0
INFO:alpha_automl.automl_api:Found pipeline, time=0:01:35, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.9982190560997328
INFO:alpha_automl.automl_api:Found pipeline, time=0:01:37, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.918967052537845
INFO:alpha_automl.automl_api:Found pipeline, time=0:01:37, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.8076580587711487
INFO:alpha_automl.automl_api:Found pipeline, time=0:01:41, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.7880676758682101
INFO:alpha_automl.automl_api:Found pipeline, time=0:01:41, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.9973285841495992
INFO:alpha_automl.automl_api:Found pipeline, time=0:01:41, scoring...
[LightGBM] [Info] Number of positive: 102, number of negative: 393
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.094171 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1634
[LightGBM] [Info] Number of data points in the train set: 495, number of used features: 22
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.206061 -> initscore=-1.348837
INFO:alpha_automl.automl_api:Scored pipeline, score=1.0
INFO:alpha_automl.automl_api:Found pipeline, time=0:01:50, scoring...
[LightGBM] [Info] Start training from score -1.348837
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
INFO:alpha_automl.automl_api:Scored pipeline, score=0.9848619768477294
INFO:alpha_automl.automl_api:Found pipeline, time=0:01:50, scoring...
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf

[LightGBM] [Warning] No further splits with positive gain, best gain: -inf[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[...]

Show the group of the optional dependencies to install it

When the user installed the default version of alpha-automl and then he try to use it for NLP tasks, this should show something like this:

WARNING:root:Missing optional dependency "fasttext". Use pip install alpha-automl[nlp] to install it.

Memory overflow when running score_pipeline

score, start_time, end_time = score_pipeline(pipeline.get_pipeline(), self.X, self.y, self.scoring,

Found out a memory overflow issue for running score_pipeline with the whole dataset in automl_manager.py. This will likely crush the whole process when working with a large dataset. I encountered this issue for automl cup phase 1 (98,000 rows of data). Here’s my solution so far: 23c9ba4

This problem also occurs when testing huggingface text classification example locally with docker image.

Nested pipelines cannot visualize properly

I am implementing semi-supervised learning pipeline where one task pipeline is as follow:
S -> IMPUTER ENCODERS SEMISUPERVISED_CLASSIFIER CLASSIFIER
In this task, the SEMISUPERVISED_CLASSIFIER take CLASSIFIER as an input parameter. The following shows one of the examples generated by pipeline_search:
Screenshot from 2023-05-26 13-36-18
However, when I run the plot_comparison_pipelines. It emits an error

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[15], line 1
----> 1 automl.plot_comparison_pipelines()

File ~/alpha_automl_eden/alpha-automl/alpha_automl/automl_api.py:256, in BaseAutoML.plot_comparison_pipelines(self, precomputed_pipelines, precomputed_primitive_types)
    254     sign = get_sign_sorting(self.scorer._score_func, self.score_sorting)
    255     pipelines, primitive_types = make_d3m_pipelines(self.pipelines, self.new_primitives, self.metric, sign)
--> 256     plot_comparison_pipelines(pipelines, primitive_types)
    257 else:
    258     plot_comparison_pipelines(precomputed_pipelines, precomputed_primitive_types)

File ~/alpha_automl_eden/alpha-automl/alpha_automl/visualization.py:7, in plot_comparison_pipelines(pipelines, primitive_types)
      5 def plot_comparison_pipelines(pipelines, primitive_types=None):
      6     import PipelineProfiler
----> 7     PipelineProfiler.plot_pipeline_matrix(pipelines, primitive_types)

File ~/alpha_automl_eden/lib/python3.10/site-packages/PipelineProfiler/_plot_pipeline_matrix.py:194, in plot_pipeline_matrix(pipelines, manual_primitive_types)
    192 id = id_generator()
    193 data_dict = prepare_data_pipeline_matrix(pipelines, manual_primitive_types)
--> 194 html_all = make_html(data_dict, id)
    195 display(HTML(html_all))

File ~/alpha_automl_eden/lib/python3.10/site-packages/PipelineProfiler/_plot_pipeline_matrix.py:66, in make_html(data_dict, id)
     49 lib_path = pkg_resources.resource_filename(__name__, "build/pipelineVis.js")
     50 bundle = open(lib_path, "r", encoding="utf8").read()
     51 html_all = """
     52 <html>
     53 <head>
     54 </head>
     55 <body>
     56     <script>
     57     {bundle}
     58     </script>
     59     <div id="{id}">
     60     </div>
     61     <script>
     62         pipelineVis.renderPipelineMatrixBundle("#{id}", {data_dict});
     63     </script>
     64 </body>
     65 </html>
---> 66 """.format(bundle=bundle, id=id, data_dict=json.dumps(data_dict))
     67 return html_all

File /usr/lib/python3.10/json/__init__.py:231, in dumps(obj, skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, default, sort_keys, **kw)
    226 # cached encoder
    227 if (not skipkeys and ensure_ascii and
    228     check_circular and allow_nan and
    229     cls is None and indent is None and separators is None and
    230     default is None and not sort_keys and not kw):
--> 231     return _default_encoder.encode(obj)
    232 if cls is None:
    233     cls = JSONEncoder

File /usr/lib/python3.10/json/encoder.py:199, in JSONEncoder.encode(self, o)
    195         return encode_basestring(o)
    196 # This doesn't pass the iterator directly to ''.join() because the
    197 # exceptions aren't as detailed.  The list call should be roughly
    198 # equivalent to the PySequence_Fast that ''.join() would do.
--> 199 chunks = self.iterencode(o, _one_shot=True)
    200 if not isinstance(chunks, (list, tuple)):
    201     chunks = list(chunks)

File /usr/lib/python3.10/json/encoder.py:257, in JSONEncoder.iterencode(self, o, _one_shot)
    252 else:
    253     _iterencode = _make_iterencode(
    254         markers, self.default, _encoder, self.indent, floatstr,
    255         self.key_separator, self.item_separator, self.sort_keys,
    256         self.skipkeys, _one_shot)
--> 257 return _iterencode(o, 0)

File /usr/lib/python3.10/json/encoder.py:179, in JSONEncoder.default(self, o)
    160 def default(self, o):
    161     """Implement this method in a subclass such that it returns
    162     a serializable object for ``o``, or calls the base implementation
    163     (to raise a ``TypeError``).
   (...)
    177 
    178     """
--> 179     raise TypeError(f'Object of type {o.__class__.__name__} '
    180                     f'is not JSON serializable')

TypeError: Object of type LGBMClassifier is not JSON serializable

The output of make_d3m_pipelines(automl.pipelines, automl.new_primitives, automl.metric, -1) is attached, and seems alright to me:

([{'pipeline_id': 'Pipeline #1',
   'inputs': [{'name': 'input dataset'}],
   'steps': [{'primitive': {'python_path': 'alpha_automl.primitives.compose.ColumnTransformer',
      'name': 'primitive'},
     'arguments': {'input0': {'data': 'inputs.0'}},
     'outputs': [{'id': 'produce'}],
     'reference': {'type': 'CONTAINER', 'data': 'steps.0.produce'},
     'hyperparams': {}},
    {'primitive': {'python_path': 'alpha_automl.primitives.text.CountVectorizer',
      'name': 'primitive'},
     'arguments': {'input0': {'data': 'steps.0.produce'}},
     'outputs': [{'id': 'produce'}],
     'reference': {'type': 'CONTAINER', 'data': 'steps.1.produce'},
     'hyperparams': {'input': {'type': 'VALUE', 'data': 'content'},
      'encoding': {'type': 'VALUE', 'data': 'utf-8'},
      'decode_error': {'type': 'VALUE', 'data': 'strict'},
      'strip_accents': {'type': 'VALUE', 'data': None},
      'preprocessor': {'type': 'VALUE', 'data': None},
      'tokenizer': {'type': 'VALUE', 'data': None},
      'analyzer': {'type': 'VALUE', 'data': 'word'},
      'lowercase': {'type': 'VALUE', 'data': True},
      'token_pattern': {'type': 'VALUE', 'data': '(?u)\\b\\w\\w+\\b'},
      'stop_words': {'type': 'VALUE', 'data': None},
      'max_df': {'type': 'VALUE', 'data': 1.0},
      'min_df': {'type': 'VALUE', 'data': 1},
      'max_features': {'type': 'VALUE', 'data': None},
      'ngram_range': {'type': 'VALUE', 'data': (1, 1)},
      'vocabulary': {'type': 'VALUE', 'data': None},
      'binary': {'type': 'VALUE', 'data': False},
      'dtype': {'type': 'VALUE', 'data': 'int64'}}},
    {'primitive': {'python_path': 'alpha_automl.primitives.builtin_primitives.SkLabelSpreading',
      'name': 'primitive'},
     'arguments': {'input0': {'data': 'steps.1.produce'}},
     'outputs': [{'id': 'produce'}],
     'reference': {'type': 'CONTAINER', 'data': 'steps.2.produce'},
     'hyperparams': {}}],
   'outputs': [{'data': 'steps.2.produce'}],
   'pipeline_digest': 'Pipeline #1',
   'start': '2023-05-25T22:19:36.858848Z',
   'end': '2023-05-25T22:19:42.409235Z',
   'scores': [{'metric': {'metric': 'accuracy_score'},
     'value': 0.8018867924528302,
     'normalized': -0.8018867924528302}],
   'pipeline_source': {'name': 'Pipeline'}},
  {'pipeline_id': 'Pipeline #2',
   'inputs': [{'name': 'input dataset'}],
   'steps': [{'primitive': {'python_path': 'alpha_automl.primitives.compose.ColumnTransformer',
      'name': 'primitive'},
     'arguments': {'input0': {'data': 'inputs.0'}},
     'outputs': [{'id': 'produce'}],
     'reference': {'type': 'CONTAINER', 'data': 'steps.0.produce'},
     'hyperparams': {}},
    {'primitive': {'python_path': 'alpha_automl.primitives.text.CountVectorizer',
      'name': 'primitive'},
     'arguments': {'input0': {'data': 'steps.0.produce'}},
     'outputs': [{'id': 'produce'}],
     'reference': {'type': 'CONTAINER', 'data': 'steps.1.produce'},
     'hyperparams': {'input': {'type': 'VALUE', 'data': 'content'},
      'encoding': {'type': 'VALUE', 'data': 'utf-8'},
      'decode_error': {'type': 'VALUE', 'data': 'strict'},
      'strip_accents': {'type': 'VALUE', 'data': None},
      'preprocessor': {'type': 'VALUE', 'data': None},
      'tokenizer': {'type': 'VALUE', 'data': None},
      'analyzer': {'type': 'VALUE', 'data': 'word'},
      'lowercase': {'type': 'VALUE', 'data': True},
      'token_pattern': {'type': 'VALUE', 'data': '(?u)\\b\\w\\w+\\b'},
      'stop_words': {'type': 'VALUE', 'data': None},
      'max_df': {'type': 'VALUE', 'data': 1.0},
      'min_df': {'type': 'VALUE', 'data': 1},
      'max_features': {'type': 'VALUE', 'data': None},
      'ngram_range': {'type': 'VALUE', 'data': (1, 1)},
      'vocabulary': {'type': 'VALUE', 'data': None},
      'binary': {'type': 'VALUE', 'data': False},
      'dtype': {'type': 'VALUE', 'data': 'int64'}}},
    {'primitive': {'python_path': 'alpha_automl.primitives.builtin_primitives.SkLabelPropagation',
      'name': 'primitive'},
     'arguments': {'input0': {'data': 'steps.1.produce'}},
     'outputs': [{'id': 'produce'}],
     'reference': {'type': 'CONTAINER', 'data': 'steps.2.produce'},
     'hyperparams': {}}],
   'outputs': [{'data': 'steps.2.produce'}],
   'pipeline_digest': 'Pipeline #2',
   'start': '2023-05-25T22:19:47.747631Z',
   'end': '2023-05-25T22:19:53.377893Z',
   'scores': [{'metric': {'metric': 'accuracy_score'},
     'value': 0.8018867924528302,
     'normalized': -0.8018867924528302}],
   'pipeline_source': {'name': 'Pipeline'}},
  {'pipeline_id': 'Pipeline #3',
   'inputs': [{'name': 'input dataset'}],
   'steps': [{'primitive': {'python_path': 'alpha_automl.primitives.compose.ColumnTransformer',
      'name': 'primitive'},
     'arguments': {'input0': {'data': 'inputs.0'}},
     'outputs': [{'id': 'produce'}],
     'reference': {'type': 'CONTAINER', 'data': 'steps.0.produce'},
     'hyperparams': {}},
    {'primitive': {'python_path': 'alpha_automl.primitives.text.TfidfVectorizer',
      'name': 'primitive'},
     'arguments': {'input0': {'data': 'steps.0.produce'}},
     'outputs': [{'id': 'produce'}],
     'reference': {'type': 'CONTAINER', 'data': 'steps.1.produce'},
     'hyperparams': {'input': {'type': 'VALUE', 'data': 'content'},
      'encoding': {'type': 'VALUE', 'data': 'utf-8'},
      'decode_error': {'type': 'VALUE', 'data': 'strict'},
      'strip_accents': {'type': 'VALUE', 'data': None},
      'preprocessor': {'type': 'VALUE', 'data': None},
      'tokenizer': {'type': 'VALUE', 'data': None},
      'analyzer': {'type': 'VALUE', 'data': 'word'},
      'lowercase': {'type': 'VALUE', 'data': True},
      'token_pattern': {'type': 'VALUE', 'data': '(?u)\\b\\w\\w+\\b'},
      'stop_words': {'type': 'VALUE', 'data': None},
      'max_df': {'type': 'VALUE', 'data': 1.0},
      'min_df': {'type': 'VALUE', 'data': 1},
      'max_features': {'type': 'VALUE', 'data': None},
      'ngram_range': {'type': 'VALUE', 'data': (1, 1)},
      'vocabulary': {'type': 'VALUE', 'data': None},
      'binary': {'type': 'VALUE', 'data': False},
      'dtype': {'type': 'VALUE', 'data': 'float64'},
      'norm': {'type': 'VALUE', 'data': 'l2'},
      'use_idf': {'type': 'VALUE', 'data': True},
      'smooth_idf': {'type': 'VALUE', 'data': True},
      'sublinear_tf': {'type': 'VALUE', 'data': False}}},
    {'primitive': {'python_path': 'alpha_automl.primitives.semi_supervised.SelfTrainingClassifier',
      'name': 'primitive'},
     'arguments': {'input0': {'data': 'steps.1.produce'}},
     'outputs': [{'id': 'produce'}],
     'reference': {'type': 'CONTAINER', 'data': 'steps.2.produce'},
     'hyperparams': {'base_estimator': {'type': 'VALUE',
       'data': LGBMClassifier()},
      'threshold': {'type': 'VALUE', 'data': 0.75},
      'criterion': {'type': 'VALUE', 'data': 'threshold'},
      'k_best': {'type': 'VALUE', 'data': 10},
      'max_iter': {'type': 'VALUE', 'data': 10},
      'verbose': {'type': 'VALUE', 'data': False}}}],
   'outputs': [{'data': 'steps.2.produce'}],
   'pipeline_digest': 'Pipeline #3',
   'start': '2023-05-25T22:20:01.617305Z',
   'end': '2023-05-25T22:20:07.179369Z',
   'scores': [{'metric': {'metric': 'accuracy_score'},
     'value': 0.7962264150943397,
     'normalized': -0.7962264150943397}],
   'pipeline_source': {'name': 'Pipeline'}},
  {'pipeline_id': 'Pipeline #4',
   'inputs': [{'name': 'input dataset'}],
   'steps': [{'primitive': {'python_path': 'alpha_automl.primitives.compose.ColumnTransformer',
      'name': 'primitive'},
     'arguments': {'input0': {'data': 'inputs.0'}},
     'outputs': [{'id': 'produce'}],
     'reference': {'type': 'CONTAINER', 'data': 'steps.0.produce'},
     'hyperparams': {}},
    {'primitive': {'python_path': 'alpha_automl.primitives.text.TfidfVectorizer',
      'name': 'primitive'},
     'arguments': {'input0': {'data': 'steps.0.produce'}},
     'outputs': [{'id': 'produce'}],
     'reference': {'type': 'CONTAINER', 'data': 'steps.1.produce'},
     'hyperparams': {'input': {'type': 'VALUE', 'data': 'content'},
      'encoding': {'type': 'VALUE', 'data': 'utf-8'},
      'decode_error': {'type': 'VALUE', 'data': 'strict'},
      'strip_accents': {'type': 'VALUE', 'data': None},
      'preprocessor': {'type': 'VALUE', 'data': None},
      'tokenizer': {'type': 'VALUE', 'data': None},
      'analyzer': {'type': 'VALUE', 'data': 'word'},
      'lowercase': {'type': 'VALUE', 'data': True},
      'token_pattern': {'type': 'VALUE', 'data': '(?u)\\b\\w\\w+\\b'},
      'stop_words': {'type': 'VALUE', 'data': None},
      'max_df': {'type': 'VALUE', 'data': 1.0},
      'min_df': {'type': 'VALUE', 'data': 1},
      'max_features': {'type': 'VALUE', 'data': None},
      'ngram_range': {'type': 'VALUE', 'data': (1, 1)},
      'vocabulary': {'type': 'VALUE', 'data': None},
      'binary': {'type': 'VALUE', 'data': False},
      'dtype': {'type': 'VALUE', 'data': 'float64'},
      'norm': {'type': 'VALUE', 'data': 'l2'},
      'use_idf': {'type': 'VALUE', 'data': True},
      'smooth_idf': {'type': 'VALUE', 'data': True},
      'sublinear_tf': {'type': 'VALUE', 'data': False}}},
    {'primitive': {'python_path': 'alpha_automl.primitives.semi_supervised.SelfTrainingClassifier',
      'name': 'primitive'},
     'arguments': {'input0': {'data': 'steps.1.produce'}},
     'outputs': [{'id': 'produce'}],
     'reference': {'type': 'CONTAINER', 'data': 'steps.2.produce'},
     'hyperparams': {'base_estimator': {'type': 'VALUE',
       'data': BernoulliNB()},
      'threshold': {'type': 'VALUE', 'data': 0.75},
      'criterion': {'type': 'VALUE', 'data': 'threshold'},
      'k_best': {'type': 'VALUE', 'data': 10},
      'max_iter': {'type': 'VALUE', 'data': 10},
      'verbose': {'type': 'VALUE', 'data': False}}}],
   'outputs': [{'data': 'steps.2.produce'}],
   'pipeline_digest': 'Pipeline #4',
   'start': '2023-05-25T22:20:07.869250Z',
   'end': '2023-05-25T22:20:08.547936Z',
   'scores': [{'metric': {'metric': 'accuracy_score'},
     'value': 0.7886792452830189,
     'normalized': -0.7886792452830189}],
   'pipeline_source': {'name': 'Pipeline'}},
  {'pipeline_id': 'Pipeline #5',
   'inputs': [{'name': 'input dataset'}],
   'steps': [{'primitive': {'python_path': 'alpha_automl.primitives.compose.ColumnTransformer',
      'name': 'primitive'},
     'arguments': {'input0': {'data': 'inputs.0'}},
     'outputs': [{'id': 'produce'}],
     'reference': {'type': 'CONTAINER', 'data': 'steps.0.produce'},
     'hyperparams': {}},
    {'primitive': {'python_path': 'alpha_automl.primitives.text.CountVectorizer',
      'name': 'primitive'},
     'arguments': {'input0': {'data': 'steps.0.produce'}},
     'outputs': [{'id': 'produce'}],
     'reference': {'type': 'CONTAINER', 'data': 'steps.1.produce'},
     'hyperparams': {'input': {'type': 'VALUE', 'data': 'content'},
      'encoding': {'type': 'VALUE', 'data': 'utf-8'},
      'decode_error': {'type': 'VALUE', 'data': 'strict'},
      'strip_accents': {'type': 'VALUE', 'data': None},
      'preprocessor': {'type': 'VALUE', 'data': None},
      'tokenizer': {'type': 'VALUE', 'data': None},
      'analyzer': {'type': 'VALUE', 'data': 'word'},
      'lowercase': {'type': 'VALUE', 'data': True},
      'token_pattern': {'type': 'VALUE', 'data': '(?u)\\b\\w\\w+\\b'},
      'stop_words': {'type': 'VALUE', 'data': None},
      'max_df': {'type': 'VALUE', 'data': 1.0},
      'min_df': {'type': 'VALUE', 'data': 1},
      'max_features': {'type': 'VALUE', 'data': None},
      'ngram_range': {'type': 'VALUE', 'data': (1, 1)},
      'vocabulary': {'type': 'VALUE', 'data': None},
      'binary': {'type': 'VALUE', 'data': False},
      'dtype': {'type': 'VALUE', 'data': 'int64'}}},
    {'primitive': {'python_path': 'alpha_automl.primitives.semi_supervised.SelfTrainingClassifier',
      'name': 'primitive'},
     'arguments': {'input0': {'data': 'steps.1.produce'}},
     'outputs': [{'id': 'produce'}],
     'reference': {'type': 'CONTAINER', 'data': 'steps.2.produce'},
     'hyperparams': {'base_estimator': {'type': 'VALUE',
       'data': BernoulliNB()},
      'threshold': {'type': 'VALUE', 'data': 0.75},
      'criterion': {'type': 'VALUE', 'data': 'threshold'},
      'k_best': {'type': 'VALUE', 'data': 10},
      'max_iter': {'type': 'VALUE', 'data': 10},
      'verbose': {'type': 'VALUE', 'data': False}}}],
   'outputs': [{'data': 'steps.2.produce'}],
   'pipeline_digest': 'Pipeline #5',
   'start': '2023-05-25T22:20:09.338940Z',
   'end': '2023-05-25T22:20:09.984734Z',
   'scores': [{'metric': {'metric': 'accuracy_score'},
     'value': 0.7886792452830189,
     'normalized': -0.7886792452830189}],
   'pipeline_source': {'name': 'Pipeline'}},
  {'pipeline_id': 'Pipeline #6',
   'inputs': [{'name': 'input dataset'}],
   'steps': [{'primitive': {'python_path': 'alpha_automl.primitives.compose.ColumnTransformer',
      'name': 'primitive'},
     'arguments': {'input0': {'data': 'inputs.0'}},
     'outputs': [{'id': 'produce'}],
     'reference': {'type': 'CONTAINER', 'data': 'steps.0.produce'},
     'hyperparams': {}},
    {'primitive': {'python_path': 'alpha_automl.primitives.text.CountVectorizer',
      'name': 'primitive'},
     'arguments': {'input0': {'data': 'steps.0.produce'}},
     'outputs': [{'id': 'produce'}],
     'reference': {'type': 'CONTAINER', 'data': 'steps.1.produce'},
     'hyperparams': {'input': {'type': 'VALUE', 'data': 'content'},
      'encoding': {'type': 'VALUE', 'data': 'utf-8'},
      'decode_error': {'type': 'VALUE', 'data': 'strict'},
      'strip_accents': {'type': 'VALUE', 'data': None},
      'preprocessor': {'type': 'VALUE', 'data': None},
      'tokenizer': {'type': 'VALUE', 'data': None},
      'analyzer': {'type': 'VALUE', 'data': 'word'},
      'lowercase': {'type': 'VALUE', 'data': True},
      'token_pattern': {'type': 'VALUE', 'data': '(?u)\\b\\w\\w+\\b'},
      'stop_words': {'type': 'VALUE', 'data': None},
      'max_df': {'type': 'VALUE', 'data': 1.0},
      'min_df': {'type': 'VALUE', 'data': 1},
      'max_features': {'type': 'VALUE', 'data': None},
      'ngram_range': {'type': 'VALUE', 'data': (1, 1)},
      'vocabulary': {'type': 'VALUE', 'data': None},
      'binary': {'type': 'VALUE', 'data': False},
      'dtype': {'type': 'VALUE', 'data': 'int64'}}},
    {'primitive': {'python_path': 'alpha_automl.primitives.semi_supervised.SelfTrainingClassifier',
      'name': 'primitive'},
     'arguments': {'input0': {'data': 'steps.1.produce'}},
     'outputs': [{'id': 'produce'}],
     'reference': {'type': 'CONTAINER', 'data': 'steps.2.produce'},
     'hyperparams': {'base_estimator': {'type': 'VALUE',
       'data': GradientBoostingClassifier()},
      'threshold': {'type': 'VALUE', 'data': 0.75},
      'criterion': {'type': 'VALUE', 'data': 'threshold'},
      'k_best': {'type': 'VALUE', 'data': 10},
      'max_iter': {'type': 'VALUE', 'data': 10},
      'verbose': {'type': 'VALUE', 'data': False}}}],
   'outputs': [{'data': 'steps.2.produce'}],
   'pipeline_digest': 'Pipeline #6',
   'start': '2023-05-25T22:21:57.988906Z',
   'end': '2023-05-25T22:22:37.808992Z',
   'scores': [{'metric': {'metric': 'accuracy_score'},
     'value': 0.779245283018868,
     'normalized': -0.779245283018868}],
   'pipeline_source': {'name': 'Pipeline'}},
  {'pipeline_id': 'Pipeline #7',
   'inputs': [{'name': 'input dataset'}],
   'steps': [{'primitive': {'python_path': 'alpha_automl.primitives.compose.ColumnTransformer',
      'name': 'primitive'},
     'arguments': {'input0': {'data': 'inputs.0'}},
     'outputs': [{'id': 'produce'}],
     'reference': {'type': 'CONTAINER', 'data': 'steps.0.produce'},
     'hyperparams': {}},
    {'primitive': {'python_path': 'alpha_automl.primitives.text.TfidfVectorizer',
      'name': 'primitive'},
     'arguments': {'input0': {'data': 'steps.0.produce'}},
     'outputs': [{'id': 'produce'}],
     'reference': {'type': 'CONTAINER', 'data': 'steps.1.produce'},
     'hyperparams': {'input': {'type': 'VALUE', 'data': 'content'},
      'encoding': {'type': 'VALUE', 'data': 'utf-8'},
      'decode_error': {'type': 'VALUE', 'data': 'strict'},
      'strip_accents': {'type': 'VALUE', 'data': None},
      'preprocessor': {'type': 'VALUE', 'data': None},
      'tokenizer': {'type': 'VALUE', 'data': None},
      'analyzer': {'type': 'VALUE', 'data': 'word'},
      'lowercase': {'type': 'VALUE', 'data': True},
      'token_pattern': {'type': 'VALUE', 'data': '(?u)\\b\\w\\w+\\b'},
      'stop_words': {'type': 'VALUE', 'data': None},
      'max_df': {'type': 'VALUE', 'data': 1.0},
      'min_df': {'type': 'VALUE', 'data': 1},
      'max_features': {'type': 'VALUE', 'data': None},
      'ngram_range': {'type': 'VALUE', 'data': (1, 1)},
      'vocabulary': {'type': 'VALUE', 'data': None},
      'binary': {'type': 'VALUE', 'data': False},
      'dtype': {'type': 'VALUE', 'data': 'float64'},
      'norm': {'type': 'VALUE', 'data': 'l2'},
      'use_idf': {'type': 'VALUE', 'data': True},
      'smooth_idf': {'type': 'VALUE', 'data': True},
      'sublinear_tf': {'type': 'VALUE', 'data': False}}},
    {'primitive': {'python_path': 'alpha_automl.primitives.semi_supervised.SelfTrainingClassifier',
      'name': 'primitive'},
     'arguments': {'input0': {'data': 'steps.1.produce'}},
     'outputs': [{'id': 'produce'}],
     'reference': {'type': 'CONTAINER', 'data': 'steps.2.produce'},
     'hyperparams': {'base_estimator': {'type': 'VALUE',
       'data': GradientBoostingClassifier()},
      'threshold': {'type': 'VALUE', 'data': 0.75},
      'criterion': {'type': 'VALUE', 'data': 'threshold'},
      'k_best': {'type': 'VALUE', 'data': 10},
      'max_iter': {'type': 'VALUE', 'data': 10},
      'verbose': {'type': 'VALUE', 'data': False}}}],
   'outputs': [{'data': 'steps.2.produce'}],
   'pipeline_digest': 'Pipeline #7',
   'start': '2023-05-25T22:20:58.555729Z',
   'end': '2023-05-25T22:21:57.984417Z',
   'scores': [{'metric': {'metric': 'accuracy_score'},
     'value': 0.7735849056603774,
     'normalized': -0.7735849056603774}],
   'pipeline_source': {'name': 'Pipeline'}},
  {'pipeline_id': 'Pipeline #8',
   'inputs': [{'name': 'input dataset'}],
   'steps': [{'primitive': {'python_path': 'alpha_automl.primitives.compose.ColumnTransformer',
      'name': 'primitive'},
     'arguments': {'input0': {'data': 'inputs.0'}},
     'outputs': [{'id': 'produce'}],
     'reference': {'type': 'CONTAINER', 'data': 'steps.0.produce'},
     'hyperparams': {}},
    {'primitive': {'python_path': 'alpha_automl.primitives.text.TfidfVectorizer',
      'name': 'primitive'},
     'arguments': {'input0': {'data': 'steps.0.produce'}},
     'outputs': [{'id': 'produce'}],
     'reference': {'type': 'CONTAINER', 'data': 'steps.1.produce'},
     'hyperparams': {'input': {'type': 'VALUE', 'data': 'content'},
      'encoding': {'type': 'VALUE', 'data': 'utf-8'},
      'decode_error': {'type': 'VALUE', 'data': 'strict'},
      'strip_accents': {'type': 'VALUE', 'data': None},
      'preprocessor': {'type': 'VALUE', 'data': None},
      'tokenizer': {'type': 'VALUE', 'data': None},
      'analyzer': {'type': 'VALUE', 'data': 'word'},
      'lowercase': {'type': 'VALUE', 'data': True},
      'token_pattern': {'type': 'VALUE', 'data': '(?u)\\b\\w\\w+\\b'},
      'stop_words': {'type': 'VALUE', 'data': None},
      'max_df': {'type': 'VALUE', 'data': 1.0},
      'min_df': {'type': 'VALUE', 'data': 1},
      'max_features': {'type': 'VALUE', 'data': None},
      'ngram_range': {'type': 'VALUE', 'data': (1, 1)},
      'vocabulary': {'type': 'VALUE', 'data': None},
      'binary': {'type': 'VALUE', 'data': False},
      'dtype': {'type': 'VALUE', 'data': 'float64'},
      'norm': {'type': 'VALUE', 'data': 'l2'},
      'use_idf': {'type': 'VALUE', 'data': True},
      'smooth_idf': {'type': 'VALUE', 'data': True},
      'sublinear_tf': {'type': 'VALUE', 'data': False}}},
    {'primitive': {'python_path': 'alpha_automl.primitives.builtin_primitives.SkLabelSpreading',
      'name': 'primitive'},
     'arguments': {'input0': {'data': 'steps.1.produce'}},
     'outputs': [{'id': 'produce'}],
     'reference': {'type': 'CONTAINER', 'data': 'steps.2.produce'},
     'hyperparams': {}}],
   'outputs': [{'data': 'steps.2.produce'}],
   'pipeline_digest': 'Pipeline #8',
   'start': '2023-05-25T22:19:42.414144Z',
   'end': '2023-05-25T22:19:47.452415Z',
   'scores': [{'metric': {'metric': 'accuracy_score'},
     'value': 0.7358490566037735,
     'normalized': -0.7358490566037735}],
   'pipeline_source': {'name': 'Pipeline'}},
  {'pipeline_id': 'Pipeline #9',
   'inputs': [{'name': 'input dataset'}],
   'steps': [{'primitive': {'python_path': 'alpha_automl.primitives.compose.ColumnTransformer',
      'name': 'primitive'},
     'arguments': {'input0': {'data': 'inputs.0'}},
     'outputs': [{'id': 'produce'}],
     'reference': {'type': 'CONTAINER', 'data': 'steps.0.produce'},
     'hyperparams': {}},
    {'primitive': {'python_path': 'alpha_automl.primitives.text.TfidfVectorizer',
      'name': 'primitive'},
     'arguments': {'input0': {'data': 'steps.0.produce'}},
     'outputs': [{'id': 'produce'}],
     'reference': {'type': 'CONTAINER', 'data': 'steps.1.produce'},
     'hyperparams': {'input': {'type': 'VALUE', 'data': 'content'},
      'encoding': {'type': 'VALUE', 'data': 'utf-8'},
      'decode_error': {'type': 'VALUE', 'data': 'strict'},
      'strip_accents': {'type': 'VALUE', 'data': None},
      'preprocessor': {'type': 'VALUE', 'data': None},
      'tokenizer': {'type': 'VALUE', 'data': None},
      'analyzer': {'type': 'VALUE', 'data': 'word'},
      'lowercase': {'type': 'VALUE', 'data': True},
      'token_pattern': {'type': 'VALUE', 'data': '(?u)\\b\\w\\w+\\b'},
      'stop_words': {'type': 'VALUE', 'data': None},
      'max_df': {'type': 'VALUE', 'data': 1.0},
      'min_df': {'type': 'VALUE', 'data': 1},
      'max_features': {'type': 'VALUE', 'data': None},
      'ngram_range': {'type': 'VALUE', 'data': (1, 1)},
      'vocabulary': {'type': 'VALUE', 'data': None},
      'binary': {'type': 'VALUE', 'data': False},
      'dtype': {'type': 'VALUE', 'data': 'float64'},
      'norm': {'type': 'VALUE', 'data': 'l2'},
      'use_idf': {'type': 'VALUE', 'data': True},
      'smooth_idf': {'type': 'VALUE', 'data': True},
      'sublinear_tf': {'type': 'VALUE', 'data': False}}},
    {'primitive': {'python_path': 'alpha_automl.primitives.builtin_primitives.SkLabelPropagation',
      'name': 'primitive'},
     'arguments': {'input0': {'data': 'steps.1.produce'}},
     'outputs': [{'id': 'produce'}],
     'reference': {'type': 'CONTAINER', 'data': 'steps.2.produce'},
     'hyperparams': {}}],
   'outputs': [{'data': 'steps.2.produce'}],
   'pipeline_digest': 'Pipeline #9',
   'start': '2023-05-25T22:19:53.382010Z',
   'end': '2023-05-25T22:19:58.156724Z',
   'scores': [{'metric': {'metric': 'accuracy_score'},
     'value': 0.7339622641509433,
     'normalized': -0.7339622641509433}],
   'pipeline_source': {'name': 'Pipeline'}}],
 {'alpha_automl.primitives.preprocessing.OneHotEncoder': 'Categorical Encoder',
  'alpha_automl.primitives.discriminant_analysis.LinearDiscriminantAnalysis': 'Classifier',
  'alpha_automl.primitives.discriminant_analysis.QuadraticDiscriminantAnalysis': 'Classifier',
  'alpha_automl.primitives.ensemble.BaggingClassifier': 'Classifier',
  'alpha_automl.primitives.ensemble.ExtraTreesClassifier': 'Classifier',
  'alpha_automl.primitives.ensemble.GradientBoostingClassifier': 'Classifier',
  'alpha_automl.primitives.ensemble.RandomForestClassifier': 'Classifier',
  'alpha_automl.primitives.naive_bayes.BernoulliNB': 'Classifier',
  'alpha_automl.primitives.naive_bayes.GaussianNB': 'Classifier',
  'alpha_automl.primitives.naive_bayes.MultinomialNB': 'Classifier',
  'alpha_automl.primitives.neighbors.KNeighborsClassifier': 'Classifier',
  'alpha_automl.primitives.linear_model.LogisticRegression': 'Classifier',
  'alpha_automl.primitives.linear_model.PassiveAggressiveClassifier': 'Classifier',
  'alpha_automl.primitives.linear_model.SGDClassifier': 'Classifier',
  'alpha_automl.primitives.svm.LinearSVC': 'Classifier',
  'alpha_automl.primitives.svm.SVC': 'Classifier',
  'alpha_automl.primitives.tree.DecisionTreeClassifier': 'Classifier',
  'alpha_automl.primitives.xgboost.XGBClassifier': 'Classifier',
  'alpha_automl.primitives.lightgbm.LGBMClassifier': 'Classifier',
  'alpha_automl.primitives.cluster.KMeans': 'Clusterer',
  'alpha_automl.primitives.cluster.AgglomerativeClustering': 'Clusterer',
  'alpha_automl.primitives.preprocessing.OrdinalEncoder': 'Datetime Encoder',
  'alpha_automl.primitives.builtin_primitives.CyclicalFeature': 'Datetime Encoder',
  'alpha_automl.primitives.builtin_primitives.Datetime64ExpandEncoder': 'Datetime Encoder',
  'alpha_automl.primitives.builtin_primitives.DummyEncoder': 'Datetime Encoder',
  'alpha_automl.primitives.preprocessing.MaxAbsScaler': 'Feature Scaler',
  'alpha_automl.primitives.preprocessing.RobustScaler': 'Feature Scaler',
  'alpha_automl.primitives.preprocessing.StandardScaler': 'Feature Scaler',
  'alpha_automl.primitives.feature_selection.GenericUnivariateSelect': 'Feature Selector',
  'alpha_automl.primitives.feature_selection.SelectPercentile': 'Feature Selector',
  'alpha_automl.primitives.feature_selection.SelectKBest': 'Feature Selector',
  'alpha_automl.primitives.impute.SimpleImputer': 'Imputer',
  'alpha_automl.primitives.linear_model.ARDRegression': 'Regressor',
  'alpha_automl.primitives.tree.DecisionTreeRegressor': 'Regressor',
  'alpha_automl.primitives.ensemble.ExtraTreesRegressor': 'Regressor',
  'alpha_automl.primitives.gaussian_process.GaussianProcessRegressor': 'Regressor',
  'alpha_automl.primitives.ensemble.GradientBoostingRegressor': 'Regressor',
  'alpha_automl.primitives.neighbors.KNeighborsRegressor': 'Regressor',
  'alpha_automl.primitives.linear_model.Lars': 'Regressor',
  'alpha_automl.primitives.linear_model.Lasso': 'Regressor',
  'alpha_automl.primitives.linear_model.LassoCV': 'Regressor',
  'alpha_automl.primitives.svm.LinearSVR': 'Regressor',
  'alpha_automl.primitives.linear_model.PassiveAggressiveRegressor': 'Regressor',
  'alpha_automl.primitives.ensemble.RandomForestRegressor': 'Regressor',
  'alpha_automl.primitives.linear_model.Ridge': 'Regressor',
  'alpha_automl.primitives.linear_model.SGDRegressor': 'Regressor',
  'alpha_automl.primitives.svm.SVR': 'Regressor',
  'alpha_automl.primitives.linear_model.BayesianRidge': 'Regressor',
  'alpha_automl.primitives.linear_model.ElasticNet': 'Regressor',
  'alpha_automl.primitives.linear_model.HuberRegressor': 'Regressor',
  'alpha_automl.primitives.linear_model.LinearRegression': 'Regressor',
  'alpha_automl.primitives.linear_model.RANSACRegressor': 'Regressor',
  'alpha_automl.primitives.linear_model.RidgeCV': 'Regressor',
  'alpha_automl.primitives.linear_model.TheilSenRegressor': 'Regressor',
  'alpha_automl.primitives.xgboost.XGBRegressor': 'Regressor',
  'alpha_automl.primitives.lightgbm.LGBMRegressor': 'Regressor',
  'alpha_automl.primitives.text.CountVectorizer': 'Text Encoder',
  'alpha_automl.primitives.text.TfidfVectorizer': 'Text Encoder',
  'alpha_automl.primitives.compose.ColumnTransformer': 'Column Transformer',
  'alpha_automl.primitives.semi_supervised.SelfTrainingClassifier': 'Semisupervised Classifier',
  'alpha_automl.primitives.builtin_primitives.SkLabelSpreading': 'Labelpropagation Classifier',
  'alpha_automl.primitives.builtin_primitives.SkLabelPropagation': 'Labelpropagation Classifier'})

No such file or directory: 'tmp/nn_models'

I tried to run a pipeline search (from tabular_classification.ipynb) and it failed with the error:

Process Process-1:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.9/site-packages/alpha_automl/pipeline_synthesis/setup_search.py", line 112, in search_pipelines
    c.learn()
  File "/opt/conda/lib/python3.9/site-packages/alpha_automl/pipeline_search/Coach.py", line 101, in learn
    self.nnet.save_checkpoint(folder=self.args.get('checkpoint'), filename='temp.pth.tar')
  File "/opt/conda/lib/python3.9/site-packages/alpha_automl/pipeline_search/pipeline/NNet.py", line 145, in save_checkpoint
    os.mkdir(folder)
FileNotFoundError: [Errno 2] No such file or directory: 'tmp/nn_models'

To solve it, I had to create the folder tmp manually in the home directory of the JupyterLab installation. Does alpha-automl create this folder automatically?

automl.fit crushes when the time_bound is large

I was fitting my dataset(20000 rows random sampled from yellow_taxi) automl model setting time_bound=10 on my kernel(1 core 64GiB), and the kernel dies everytime.
The screenshot is attached:
image

It runs well when the time_bound is small. Marked this as an issue.

Question re: examples/timeseries_forecasting_example.ipynb

In examples/timeseries_forecasting_example.ipynb, the output shows

{'metric': 'mean_squared_error', 'score': 1.3176959065629877e-28}

for test set stock price prediction using the top-ranked pipeline:

ColumnTransformer, OrdinalEncoder, RobustScaler, LinearRegression

... so essentially perfect performance w/ linear regression. Is that right? Predicting stock prices is notoriously hard, so I'm wondering if maybe it's possible that the outcome variable is leaking into the input variables:

X_train = train_data[["Date", "Close"]]
y_train = train_data[["Close"]]

Thanks!

Error Searching Pipeline on Windows

I got this error when searching for pipeline on Windows OS:

Process Process-1:
Traceback (most recent call last):
  File "c:\users\yifan\appdata\local\programs\python\python38\lib\multiprocessing\process.py", line 315, in _bootstrap
    self.run()
  File "c:\users\yifan\appdata\local\programs\python\python38\lib\multiprocessing\process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "c:\users\yifan\yfw215\alpha-automl\alpha_automl\pipeline_synthesis\setup_search.py", line 59, in search_pipelines
    signal.signal(signal.SIGALRM, lambda signum, frame: signal_handler(queue))
AttributeError: module 'signal' has no attribute 'SIGALRM'

Update scores in tabular_*.ipynb examples

In

the scores are

{'metric': 'accuracy_score', 'score': 0.7222222222222222} # tabular_classification
{'metric': 'mean_absolute_error', 'score': 1.774809999999999} # tabular_regression

but if I actually run examples/tabular_{classification,regression}.py, I get

{'metric': 'accuracy_score', 'score': 0.8333333333333334} # tabular_classification
{'metric': 'mean_absolute_error', 'score': 1.7378600000000004} # tabular_regression

So the scores shown in .ipynb are substantially worse than the actual scores the system produces. Are you able to update the .ipynb examples to demonstrate good / optimal system performance?

Thanks!

Export code of the pipeline

The current version allows to save the pipeline as a pickle object for future uses. We should be able to export the pipeline's code as well.

New primitives from hugging face are slower than built-in text encoder primitives.

I have added two language models as two new primitives: ELECTRA and ROBERTA(domain adapted for sentiment analysis).
However, these models seem to be much slower than built-in primitives but score high when given the appropriate amount of time to train.

For example:

  1. When timebound = 10 and SBATCH --time=00:30:00:
    My primitive is call MyEmbedder() here:

image

  1. When timebound = 40 and SBATCH —time=04:00:00:
    My primitive is call MySentimentEmbedder() here:

image

My sbatch script is:

#!/bin/bash
  
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=2
#SBATCH --time=04:00:00
#SBATCH --mem=128GB
#SBATCH --job-name=torch
#SBATCH --output=4hours.out

module purge

singularity exec \
            --overlay /scratch/lm4428/my_env/overlay-50G-10M.ext3:rw \
            /scratch/work/public/singularity/cuda11.2.2-cudnn8-devel-ubuntu20.04.sif \
            /bin/bash -c "source /ext3/env.sh; python /scratch/lm4428/d3m_latest/alpha-automl/examples/adding_new_primitives_huggingface.py"

Is there a way I can decrease the amount of time it takes for these primitives?

Support for latest Python and core libraries

Need to make sure that we support the latest Python (3.11) and recent versions of core libraries. We need to remove the version restrictions of some libraries and make sure everything works fine:

  • scipy<=1.9.3
  • numpy<=1.24.3

TODO:

  • Upgrade the requirements.txt file to remove restrictions
  • Upgrade Dockerfile to use Python 3.11
  • Make GitHub CI run tests on multiple Python versions
  • Test and fix any bugs found, if any

Review primitive hyperparams that cause errors

There are some default hyperparameters that cause errors every time they are used. For instance, the 'average' hyperparameter of Sklearn Imputer, will always fail for categorical features (we can't calculate the average for this type of feature). These values should be changed, like here.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.