squaredev-io / whitebox Goto Github PK

[Not Actively Maintained] Whitebox is an open source E2E ML monitoring platform with edge capabilities that plays nicely with kubernetes

Home Page: https://squaredev.io/whitebox/

License: MIT License

Python 99.21% Dockerfile 0.10% Smarty 0.65% CSS 0.03%

machine-learning monitoring observability python explainability explainable-ai ml-monitoring mlflow mlops modelops

whitebox's People

Contributors

Stargazers

Watchers

Forkers

hjiwnain daviddavini renatocmaciel stjordanis mikehade

whitebox's Issues

Research usage of AirFlow in Whitebox

Migrate all analytics pipelines to Airflow

Since whitebox workflows become more and more complex, we need a way to orchestrate them. We can use the Airflow Python API to define and execute workflows. A workflow is defined as a directed acyclic graph (DAG) in Airflow. Each node in the DAG represents a task, and the edges between nodes represent dependencies between tasks.

The architecture should become roughly as follows:

sequenceDiagram

participant API as API
participant Database as Database
participant Airflow as Airflow

API ->> Database: Store data

loop Cron Workflows
    Database ->> Airflow: Extract data
    Airflow ->> Airflow: Analyze data
    Airflow ->> Database: Store result
end

Some implementation notes:

Airflow should be a different deployment that the API. Approach it as a different service that runs all the workflows.
It should use the same instance of SQL but have its own database as a database backend
All current (and future) metrics calculations should happen inside airflow.

Hyper parameters tuning for model

Create a pipeline that calculates data drifts per feature compared to training set

Description

As a developer, I want to be able to track data drifts per feature compared to training set that may occur in my ML app

Proposed solution

The pipeline should take as an input the training set and the inference data and calculate the drifting distance per features and return it as output.

Proposed algos

evidentlyai
nanny ml

Acceptance criteria

implementation
tests

GitHub pages publish

Add ability to define custom metrics in the monitoring pipelines

Description

At the moment, Whitebox calculates specific monitoring metrics that are considered a standard in the industry.

Some users though have use cases where these metrics are not enough and would like to integrate their own custom metrics for whitebox to calculate.

Upload the training datasets used for the creation of models

Description

For some functionality to be implemented, the training dataset of the model is needed. We need to create an endpoint that allows the user to upload the training dataset used for the creation of models. The rows should be saved in the database. We also need a way to retrieve them altogether along with the dataset info.

Proposed solution

Create the following endpoints

POST v1/dataset/metadata: Creates the dataset metadata entity in the database
POST v1/dataset/{dataset_id}/rows: Creates the rows in the database. Accepts an array of the rows as batch
GET v1/dataset/metadata: Returns the metadata of the dataset (name, etc)
GET v1/dataset/{dataset_id}: Returns the rows of the dataset

Create metrics calculation pipeline for classification models

Create a pipeline that given features, predictions, actuals etc calculates the following:
You can find more here

Simple feature metrics

Missing Count
Average
Minimum
Maximum
Sum
Variance
Standard Deviation

Model performance metrics

Precision
Recall
F1
Accuracy
True Positive Count
True Negative Count
False Positive Count
False Negative Count

Acceptance criteria

The pipeline must assume that gets all the required data as input (actuals may be missing) and returns a result of the above.
Unit test happy and error cases

Adjust the train_test_split parts in pipelines

We use "train_test_split" in "src/analytics/models/pipelines" at lines 40 & 98. In case that after the splitting only one class remains at the train set, an error will follow based on the next calculations (AUC ROC, etc.).

Need to adjust the train_test_split parts and possibly use the "stratify" argument as described inside the docs: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

Create logging endpoints according to architecture

Analytics pipeline for all model metrics

Description

As whitebox gathers the data from the model inferences, we need a way for all the metrics to be calculated every T (time). These metrics are later going to be used for visualizations, and issue #24 (not part of this issue).

Proposed solution

The proposed initial solution is a cron job that gathers all the needed data for every metric and calculates it, then saves it to the database. The time the cron job runs should be configurable from an ENV variable (this can change in later iterations).

The pipelines the cron needs to run are the following:

Value error in ROC AUC score

The error is:

How to reproduce the error:
Create a new small dataset of max 5 rows with only 2 labels and train a model where it predicts always the one value out of 2.

How to fix the error:
(There are 2 parts that need to be fixed)

In Line 67 in analytics/models/pipeline need to change the position for calculation of the ROC score

Based on the documentation of the package:
Also at the splitting part to train and test set, need to ensure that each one of the classes will always be represented into the test side, otherwise, the same error will occur again.

Content

From megas' chapters

Tutorial - user guide (Step by step detailed guide with examples on how to use WB)

Handle inference sets with some actuals

In order to calculate performance metrics for a model, the actuals from the inference rows are required.
As of now, whitebox only handles the cases where an inference set has all the actuals or an inference set has no actuals. In case there's a set of inferences with only some actuals, the performance metrics pipeline won't work as expected.

Proposed solution:
Since actuals are optional, filter out those inference rows that don't have actuals and execute the performance metrics pipeline only for those rows that have actuals. In case there are no actuals at all, the pipeline will be skipped (that's already implemented).

[Roadmap] Add regression models

Description

Machine Learning Regression is a technique for investigating the relationship between independent variables or features and a dependent variable or outcome. It's used as a method for predictive modelling in machine learning, in which an algorithm is used to predict continuous outcomes.

Whitebox is currently producing metrics for classification models. We need to expand this to produce metrics for regression models as well.

Metrics

Example of metrics:

Mean Squared Error
Root Mean Squared Error
Mean Absolute Error

The outcome should be that the above metrics are stored in the database just like classification problems and then depending on the model the API needs to return the appropriate results.

Log inferences

Description

As a developer I want to be able to log my mode's input data and predictions in prod

Data to be logged

features
predictions
raw inputs
actuals

Tasks

Inferences basic APIs
Inferences batch create

Support segments

Add the ability to track data segments

Create project and model CRUD endpoints

Acceptance criteria

CRUD endpoints
tests

Add explainability features in Readme

Documentation is missing explainability features, which are already included in the MVP.

Outline / chapters

Track model's performance

Description

As a developer I want to be able to track my model's performance with specific metrics

Proposed metrics:

Mean Squared Error
Root Mean Squared Error
Mean Absolute Error
Precision
Recall
F1
Accuracy
True Positive Count
True Negative Count
False Positive Count
False Negative Count

Whitebox team onboarding

Missing values count should be performed in unprocessed dataset

As the missing value count is an indicator of missing values in a feature of a dataset, has much more value to be performed in the unprocessed dataset rather than in the processed one - most likely a handling procedure of missing values would had been performed already in the processed dataset.

Also, for monitoring and alerting reason, has more value to count the missing values as a percentage of the total entries of each feature (e.g. if a feature has 100 entries and 3 missing values the missing values count is 3%). This ease a lot the setting of thresholds for the specific metric.

Create a Level0 explainability pipeline

Register model and model data

Description

As a developer, I want to be able to register my model and model data

Objects to be registered

Model
Training data
Test data

Analysis

Upon registering the model, a surrogate model should be created to be used for XAI

Create pipeline that given the training dataset for classification trains a decision tree based model (LightGBM, XGBoost or similar)

Link docs in our Website

This line breaks when few data is passed with not enough variations

whitebox/src/analytics/metrics/pipelines.py

Lines 104 to 105 in 2289186

 **format_evaluation_metrics_binary( 

 accuracy, precision, recall, f1, tn, fp, fn, tp

MKDocs setup in repo

[Roadmap] Publish docs site

I propose we use MKDocs Material

Create MVP API according to architecture blueprints

Acceptance criteria

Basic file structure setup
Basic endpoints for health and auth
Tests
CICD

[Roadmap] Publish Helm charts

Description

Create helm chart files so we and other users can deploy whitebox on their environment

Proposed solution

Create the helm chart in this repo in a helm folder in the root of the project

Update evidently to next stable release

Evidently package installed is a beta version. Would be recommend to update to the next stable release (0.2.0) or latest release (0.2.1). New stable releases though bring some braking changes to the table that affects whitebox, causing a TypeError: list indices must be integers or slices, not str.

More on the breaking changes:
https://github.com/evidentlyai/evidently/releases/tag/v0.2.0

Detect missing feature values

Description

As a developer I want to track feature quality over the data sent to my model and uncover missing feature values,

Explain model inference functionality

Description

We need to create an endpoint that, given the id of an inference that exists in the database, the API will return the explanation using SHAP values or a similar algo. For this to be possible, we need 3 things to be in place.

A training dataset
A decision tree based trained model
The inference on which to perform the explainability (should be in the database).

So we need the following to be completed:

[Roadmap] Add example notebooks in the repo and docs and embed them in docs

We want to improve the usability of our documentation by adding example notebooks that demonstrate how to use the various features of our software. These notebooks should be included in the documentation, along with embedded versions that users can interact with. This will allow users to see the code in action and experiment with the examples themselves. In addition, we believe that this will make it easier for users to understand how to use the software and will help them get up and running more quickly.

Copy information of a model from MLFlow

Description

Since MLFlow is an industry standard and a lot of people use it, it makes sense that whitebox integrates with it and uses it as a data store, or something similar providing missing functionality in the monitoring field of MLOps

Whitebox Grafana dashboard

As whitebox produces metrics, we need a way to display them in a dashboard.

Grafana is a nice and flexible dashboard tool made exactly for this kind of purposes.

The dashboard should display key metrics and statistics related to the performance of machine learning models. This will help users to track the performance of models over time, identify areas for improvement, and ensure that models are serving customers effectively.

Some ideas for what to include in the dashboard:

Model accuracy over time
Model training and evaluation times
Model resource usage (e.g. GPU memory and compute time)
Number of model predictions made

In short, we need a dashboard that displays all Whitebox metrics depending on model type.

Note: Since dashboards in Grafana cannot be dynamic (you only import a JSON) we may need an endpoint that produces the said JSON so that the user can copy it and import it in Grafana. This should also be doable from the SDK as well.

Training model pipeline failing with KeyError

The way the API and pipeline are designed right now:

A user inserts their dataset rows and when saved to the database the training model pipeline is triggered.
If everything is successful the trained model is saved in the user's filesystem.

Here there's a possibility that the pipeline will fail with a KeyError because the given target when creating a model doesn't exist in the training dataset (a user's fault). The thing is though that a trained model isn't produced and the dataset rows are saved in the database. The user can't delete them from the SDK or API and if they insert new and correct ones, it's going to cause a chaos with all the metrics calculations due to columns mismatches.

Whitebox SDK

Get training and inference datasets from S3

Is your feature request related to a problem? Please describe.
At the moment, whitebox only accepts the training dataset and inferences through the API. In order to load data faster, users should be able to define how whitebox can retrieve this data from an SQL database.

Describe the solution you'd like
In the UI / API / SDK the user should be able to choose a source for the model training dataset and inferences. A possible flow is the following:

Create the model (either through UI or SDK)
Go to settings and select where the data will be coming from (S3, in the future SQL, etc)
Add the required credentials for S3 bucket and the corresponding file

Describe alternatives you've considered
At the moment, whitebox only accepts the training dataset and inferences through the API. This is an alternative but we need to extend it.

Additional context
Here is an exmaple of how Aporia does something similar: https://docs.aporia.com/introduction/quickstart

Monitoring / alerting functionality

Description

Implement the ability for a user to create a monitor for a metric and get alerts when conditions are met.

The metrics that can be monitored are:

accuracy
precision
recall
f1
data drift for a feature
concept drift for a feature
missing values count for a feature

Proposed solution

We need some rest endpoints to create the monitors (this is quite complex). A CRUD set of endpoints should be enough.

Then, after the analytics pipelines run, we can fetch the monitors from the database and compare the conditions of the monitors with the analytics result. When a condition is met, an alert should be created in the database.

Add ability to define data segments in training and inference sets

Description

At the moment in whitebox you can upload a training dataset that can be used as a basis for measuring drift.

In some use cases, though, users what to only use a specific part of the dataset for calculations (a data segment).

Add readme.md

Schema naming pattern and default properties

There's a need to discuss the schema's naming pattern and default values because with constant changes and inconsistencies the situation will get complicated and difficult to fix as app is scaling.

Inserting timestamps and actuals of inferences through SDK

The way SDK is designed at the moment, there are two issues concerning inferences:

Timestamps:
The user can insert a timestamp that will apply to the whole set of inferences but not to a specific row individually.

Actuals:
The user cannot insert actuals for the inferences. Even if they do by placing them in the processed dataframe, the SDK won't handle it and the 'actual' field will always be empty. Since there's no way for the user to update the actuals at a later time, the user will never get the performance metrics.

Pipeline

def create_xai_pipeline_classification_per_inference_dataset(training_set: pd.DataFrame, target: str, inference_set: pd.DataFrame, type_of_task: str, load_from_path = None
)-> Dict[str, Dict[str, float]]:
    
    xai_dataset=training_set.drop(columns=[target])
    explainability_report={}

    # Make a mapping dict which will be used lated to map the explainer index
    # with the features names

    mapping_dict={}
    for feature in range (0,len(xai_dataset.columns.tolist())):
        mapping_dict[feature]=xai_dataset.columns.tolist()[feature]


    # Expainability for both classifications tasks
    # We have again to revisit here in the future as in case we upload the model
    # from the file system we don't care if it is binary or multiclass

    if type_of_task=='multiclass_classification':
        
        # Giving the option of retrieving the local model

        if load_from_path != None:
            model = joblib.load('{}/lgb_multi.pkl'.format(load_from_path))
        else:
            model, eval = create_multiclass_classification_training_model_pipeline(training_set, target)
            explainer = lime.lime_tabular.LimeTabularExplainer(xai_dataset.values, feature_names=xai_dataset.columns.values.tolist(), mode="classification",random_state=1)
        
        for inference_row in range(0,len(inference_set)):
            exp = explainer.explain_instance(inference_set.values[inference_row], model.predict)
            med_report=exp.as_map()
            temp_dict = dict(list(med_report.values())[0])
            map_dict = {mapping_dict[name]: val for name, val in temp_dict.items()}
            explainability_report["row{}".format(inference_row)]= map_dict
               

    elif type_of_task=='binary_classification':     
        
        # Giving the option of retrieving the local model

        if load_from_path != None:
            model = joblib.load('{}/lgb_binary.pkl'.format(load_from_path))
        else:
            model, eval = create_binary_classification_training_model_pipeline(training_set, target) 
            explainer = lime.lime_tabular.LimeTabularExplainer(xai_dataset.values, feature_names=xai_dataset.columns.values.tolist(), mode="classification",random_state=1)

        for inference_row in range(0,len(inference_set)):
            exp = explainer.explain_instance(inference_set.values[inference_row], model.predict_proba)
            med_report=exp.as_map()
            temp_dict = dict(list(med_report.values())[0])
            map_dict = {mapping_dict[name]: val for name, val in temp_dict.items()}
            explainability_report["row{}".format(inference_row)]= map_dict

            
    return explainability_report

Unit tests

def test_create_xai_pipeline_classification_per_inference_dataset(self):
        binary_class_report =create_xai_pipeline_classification(df_binary,"target",df_binary_inference,"binary_classification")
        multi_class_report=create_xai_pipeline_classification(df_multi,"target",df_multi_inference,"multiclass_classification")
        binary_contribution_check_one = binary_class_report["row0"]["worst perimeter"]
        binary_contribution_check_two = binary_class_report["row2"]['worst texture']
        multi_contribution_check_one = multi_class_report["row0"]["hue"]
        multi_contribution_check_two = multi_class_report["row9"]["proanthocyanins"]
        assert (len(binary_class_report)) == len(df_binary_inference)
        assert (len(multi_class_report)) == len(df_multi_inference)
        assert (round(binary_contribution_check_one, 3)) == 0.253
        assert (round(binary_contribution_check_two, 2)) == -0.09
        assert (round(multi_contribution_check_one, 2)) == -0.08
        assert (round(multi_contribution_check_two, 3)) == -0.023

	**format_evaluation_metrics_binary(
	accuracy, precision, recall, f1, tn, fp, fn, tp

squaredev-io / whitebox Goto Github PK

whitebox's People

Contributors

Stargazers

Watchers

Forkers

whitebox's Issues

Some implementation notes:

Description

Proposed solution

Proposed algos

Acceptance criteria

Description

Description

Proposed solution

Simple feature metrics

Model performance metrics

Acceptance criteria

Description

Proposed solution

The error is:

Description

Metrics

Description

Data to be logged

Tasks

Acceptance criteria

Description

Proposed metrics:

Description

Objects to be registered

Analysis

Acceptance criteria

Description

Proposed solution

Description

Description

Description

Description

Proposed solution

Description

Recommend Projects

Recommend Topics

Recommend Org