ahnung

Automated Machine Learning Pipeline for MongoDB Collections, written in Python and leveraging AutoSKLearn and Flask.

Requirements
Install
Usage
Examples, Results, Video
References
AutoML Papers

Intro

Ahnung allows data scientists and machine learning specialists to configure a generic, automated ML pipeline against data in any MongoDB collection. The pipeline automatically evaluates, optimizes and visualizes the performance of multiple machine learning models, feature engineering options and hyperparameters. By systematizing and automating these tedious and error-prone tasks, Ahnung accelerates the thorough exploration of machine learning approaches on new datasets.

The Ahnung pipeline generates an ensemble of machine learning models to accomplish the selected task. The last stage of the pipeline provides a visualization of the ensemble similar to the one below.

Ahnung uses documents from the MongoDB collection as training instances (or observations). Document attributes provide the features (or attributes) for supervised machine learning tasks. Due to the dynamic schema nature of MongoDB collections, Ahnung performs a schema analysis to discover the names of the document attributes, their data types and frequencies. The pipeline recognizes categorical and numeric features and provides median or mode substitution for missing values.

You will provide Ahnung with MongoDB connection strings for the read-only, source database and one or more read-write databases used by the various stages to store metadata and machine learning models. Additionally, you will specify the target attribute. The ML pipeline will learn to predict the value of the target attribute from a given document containing the other attributes. In order to implement this task, Ahnung will construct or fit an ensemble of machine learning models to the dataset in the source collection. AutoSKLearn is used to accomplish many necessary tasks, including model selection and hyperparameter optimization.

Ahnung includes four stages: schema, cleanup, model and predict. These stages must be performed in this order to satisfy dependencies from the previous stages. However, it is common to make adjustments to the AutoSKLearn settings, for example, and rerun the model and predict stages without requiring the previous two.

Schema

The schema stage scans every document in the collection to determine the data type(s) and frequency of each attribute. The primary goal is to discover a set of numerical and categorical attributes that are frequently present in the training dataset. An attempt is made to recognize implicit data types encoded in strings (ie. floats and ints) so that those can be automatically converted to binary values compatible with machine learning algorithms.

Cleanup

The cleanup stage scans every document and converts each attribute to the selected data type. This includes converting string values to binary values to align with other instances in the dataset, as discovered in the schema stage. Missing attributes are replaced with the median or mode of the feature, unless there are too many missing from the same instance. If there are too many missing attributes, the instance is discarded.

Model

The model stage loads the normalized documents from the previous stage and converts them into a pandas DataFrame for use with AutoSKLearn. This stage uses AutoSKLearn to search various machine learning models and their hyperparameter configurations. The best of the models is then used to build an ensemble model for predicting the given target value.

Predict

The predict stage provides a REST interface to predict the target attribute value from a JSON document containing the values of the other attributes. Additionally, this stage provides statistics and visualizations of various pipeline stage results. These are intended to provide insight into how it was constructed and how well the ML pipeline was able to generalize from the dataset provided. The following are provided for the ensemble estimator.

ROC AUC, Precision and Recall per label value against held out data
Prediction URL, with example curl command
PipelineProfiler of the ensemble.
Confusion Matrix for classification tasks against held out data
ROC Curve Charts
Feature Importance via Permutation

Requirements

Many of the requirements for using Ahnung stem from the system requirements of AutoSKLearn. Additionally, configuration of the Python environment is most easily managed by Anaconda, so it must also be installed.

Linux Operating System
Python >= 3.5
C++ compiler
SWIG
Anaconda

For what its worth, all of my usage and testing has been on Ubuntu so far.

Install

First, checkout or download the Ahnung project from GitHub. Open a terminal and cd into the project. Run the following Anaconda command to configure the ahnung_env Python environment.

conda env create -f ahnung_env_conda_auto_sklearn.yml

This will take some time and some bandwidth. When it finishes, you should see lines similar to:

Installing collected packages: scikit-learn, liac-arff, ConfigSpace, pynisher, pyrfr, lazy-import, smac, py, iniconfig, more-itertools, pytest, auto-sklearn, networkx, Send2Trash, prometheus-client, terminado, argon2-cffi, notebook, pipelineprofiler
Successfully installed ConfigSpace-0.4.13 Send2Trash-1.5.0 argon2-cffi-20.1.0 auto-sklearn-0.8.0 iniconfig-1.0.1 lazy-import-0.2.2 liac-arff-2.4.0 more-itertools-8.4.0 networkx-2.4 notebook-6.1.3 pipelineprofiler-0.1.15 prometheus-client-0.8.0 py-1.9.0 pynisher-0.5.0 pyrfr-0.8.0 pytest-6.0.1 scikit-learn-0.22.2.post1 smac-0.12.3 terminado-0.8.3

Next, you will need to activate the Python environment so that Ahnung has access to all the required goodies (ahem, libraries).

$ conda activate ahnung_env
(ahnung_env) myuser@my-box:~/ahnung$

Now, you can get the short usage message from run_pipeline.py.

$ ./run_pipeline.py 
Specify the configuration settings JSON file as the first parameter.
USAGE: ./run_pipeline.py <settings.json> [first_stage [second_stage [...]]]
    - Valid stages are: schema, cleanup, model and predict.

Usage

The command line interface to Ahnung is encapsulated in run_pipeline.py.

Specify the configuration settings JSON file as the first parameter.
USAGE: ./run_pipeline.py <settings.json> [first_stage [second_stage [...]]]
    - Valid stages are: schema, cleanup, model and predict.

You can find sample JSON settings files in the examples directory of the project.

The configuration from the settings file is loaded into a Python dictionary. Areas in the file can be broken down based on the dictionary key that contains those settings.

global_properties

Contains the list of one or more estimators, global random seed and global default modeling settings. Each estimator in the est_list array is described by various values.

Setting Name	Expected Type	Default	Description
`target_name`	string	N/A	Name of the document attribute (instance feature) that Ahnung will learn to predict
`src_collname`	string	N/A	Name of the collection containing the source dataset documents
`is_classification`	boolean	'true'	Indicates the machine learning task is classification
`is_regression`	boolean	'false'	Indicates the machine learning task is regression
`allowed_cpus`	integer	"1"	Maximum number of jobs launched by AutoSKLearn
`max_global_time`	integer	"600"	Controls time_left_for_this_task setting to specify the global time allowed for searching for models and model hyperparameters
`max_permodel_time`	integer	"60"	Controls per_run_time_limit setting to specify the time allowed for a single call to fit the data by a single machine learning model
`metric`	string	"accuracy"	The criteria or formula for evaluating the performance of machine learning models during the construction of the ensemble. The relative performance of a model is important during training, hyper-parameter optimization and ensemble construction. Supported values include: `accuracy`, `balanced_accuracy`, `f1_macro`, `f1_micro`, `roc_auc`, `precision_macro`, `precision_micro`, `average_precision`, `recall_macro`, `recall_micro` and `log_loss`. For categorization, only metrics appropriate to multi-class tasks are supported, refer to AutoSKLearn built-in metrics.
`ensemble_size`	integer	"50"	Maximum number of models that may be added to the ensemble generated by AutoSKLearn
`ensemble_nbest`	float or integer	"0.2"	Fraction or number of the best models to drawn from when constructing the ensemble. Refer to the Ensemble Building Process.
`max_models_on_disc`	integer	"50"	Limits the number of machine learning models that can be stored on the filesystem.
`random_seed`	integer	"10001"	Random seed integer value for the machine learning algorithms

schema_properties

This subdocument controls the settings used by Ahnung to analyze the schema of the source collection containing the training and testing dataset input to the machine learning algorithms.

Setting Name	Expected Type	Default	Description
`attr_type_min_present`	float	"0.8"	Minimum fraction (between 0.0 and 1.0 exclusive) of the documents containing an attribute for that attribute to be used during model search/fitting/tuning.
`attr_type_min_typealign`	float	"0.8"	Minimum fraction (between 0.0 and 1.0 exclusive) of the documents containing the same data type for a feature/attribute for that data type to be selected as the normalized type for the feature.
`max_categorical_values`	integer	"10"	The maximum number of unique values allowed for a feature/attribute to be recognized as categorical.

model_properties

This subdocument controls modeling settings not aligned with AutoSKLearn.

Setting Name	Expected Type	Default	Description
`target_category_balancing`	string	"none"	Applies only to classification tasks. Selects the `target` feature/attribute class rebalancing type performed, if any. Selecting "equal" causes all classes to appear with equal frequency in the rebalanced dataset. Selecting "average" causes the native/input frequency to be arithmetically averaged with the equal weighting and the rebalancing adjusts to the averaged frequencies. The point of "average" is to provide a middle ground between the raw frequencies (priors) and full equalization (no priors).
`category_max_oversample`	float	"2.0"	When rebalancing to new class frequencies, this selects the factor of over sampling allowed in less frequent classes to avoid loosing useful information in the data points of more frequent classes. You can disable over sampling completely with a value of "1.0".

service_properties

This subdocument controls the predict stage which provides an HTTP service with REST API.

Setting Name	Expected Type	Default	Description
`service_hostname`	string	"localhost"	The hostname or IP address where the HTTP service should listen for connections and accept requests.
`service_port`	integer	"8088"	The port number on which the HTTP service should listen for connections and accept requests.

connect_uris

This subdocument contains four MongoDB connection strings for accessing the MongoDB database. Note that before accessing the MongoDB database, Ahnung prompts for the username and password from the keyboard. Use the constant usercredsplaceholder to specify the location of the username and password in the MongoDB connection string.

Setting Name	Expected Type	Default	Description
`source_uri`	string	N/A, required	Specifies the collection containing the source dataset for the machine learning pipeline. Read-only access is required to this collection. The documents in this collection must contain the `target_name` attribute in order for the pipeline to model (learn from) the dataset.
`rawdocs_uri`	string	N/A, required	Specifies the collection used to store intermediate documents resulting from the `schema` stage. Read+write access is required to this collection.
`metadata_uri`	string	N/A, required	Specifies the collection used to store metadata or statistics about the dataset, its schema and the AutoSKLearn model search and ensemble build. Also holds the final AutoSKLearn ensemble and individual component models. Read+write access is required to this collection.
`cleaned_uri`	string	N/A, required	Specifies the collection used to store intermediate documents resulting from the `cleanup` stage. Read+write access is required to this collection.

Examples

Sample Video

Introduction to Ahnung Video

Sample Results

You can find static examples of the charts, statistics and ensemble construction results in the links below. Note that the GitHub project is not running the prediction stage of Ahnung. The prediction URL for each result will not work until you install and run Ahnung on your own datasets.

Dataset Name	Results	Dataset Source
Early Stage Diabetes Risk Prediction (UCI)	Ahnung Results	Early stage diabetes risk prediction dataset.
Adult Income (UCI)	Ahnung Results	Adult Data Set
Wine (UCI)	Ahnung Results	Wine Data Set
Iris (UCI)	Ahnung Results	Iris Data Set

Sample JSON Configuration

The following JSON configuration file contains settings for Ahnung to learn a categorization task against the well-known iris and wine datasets available from the UCI Machine Learning Repository. The datasets were downloaded, converted from CSV to JSON and loaded into the Atlas cluster referenced under connect_uris. Refer to datasets/install_datasets.sh for an example script.

{
    "global_properties": {
        "num_folds": "5",
        "random_seed": "1000001",
        "allowed_cpus": "1",
        "max_global_time": "360",
        "max_permodel_time": "36",
        "est_list": [
            {
                "src_collname": "iris",
                "target_name": "class",
                "is_classification": "true",
                "is_regression": "false",
                "allowed_cpus": "2",
                "max_global_time": "240",
                "max_permodel_time": "24",
                "random_seed": "3000003"
            },  
            {   
                "src_collname": "wine",
                "target_name": "class",
                "is_classification": "true",
                "is_regression": "false",
                "allowed_cpus": "2",
                "max_global_time": "240",
                "max_permodel_time": "24",
                "random_seed": "5000005"
            }   
        ]
    },
    "schema_properties": {
        "attr_type_min_present": "0.8",
        "attr_type_min_typealign": "0.8",
        "max_categorical_values": "20"
    },
    "model_properties": {
        "target_category_balancing": "none",
        "category_max_oversample": "2.0"
    },  
    "service_properties": {
        "service_hostname": "localhost",
        "service_port": "8088"
    },  
    "connect_uris": {
        "source_uri": "mongodb+srv://[email protected]/UCI-ML?authSource=admin&w=majority",
        "rawdocs_uri": "mongodb+srv://[email protected]/rawdocs?authSource=admin&w=majority",
        "metadata_uri": "mongodb+srv://[email protected]/metadata?authSource=admin&w=majority",
        "cleaned_uri": "mongodb+srv://[email protected]/cleaned?authSource=admin&w=majority"
    }
}

References

Ahnung leans heavily on a large number of Python based projects, including the following.

AutoSKLearn: GitHub and Documentation
Flask Web Framework: GitHub and Website
PipelineProfiler: PyPi.org and Blog Post

wbrianblevins / ahnung Goto Github PK

ahnung's Introduction