thinkbiganalytics / aoademomodels Goto Github PK

This repository contains the example / demo models for Teradata AnalyticOps (ModelOps). The goal of these examples is to provide a simple reference implementation for users and not to provide a detailed data science example for each use case. We provide both Python and R examples along with Classification and Regression model types.

License: BSD 3-Clause "New" or "Revised" License

Python 5.14% Jupyter Notebook 92.00% R 2.87%

aoademomodels's People

Contributors

Stargazers

Watchers

aoademomodels's Issues

feat: Add PIMA Diabetes demo model using R in batch mode

As the R batch engine is available, also add the PIMA Diabetes model for R in batch mode.

Add MLE Models Project

The demo MLE models which were here should be moved into a dedicated MLE project.

request new repo creation AoaMleDemoModels
add the models.
request that this new project is added to the demo-data then.

Add a regression model

Demand forecasting example would be idea.. But other ideas are welcome which don’t include time..
Ideally with one categorical input.. Doesn’t have to be but from a monitoring testing perspective this would be ideal..
Something simple like we have for Classification (demo diabetes). The goal is to show the workflow, not some complex data science example/problem
Data obviously should be in Teradata (our demo env)
Python code
Should be convertible to pmml (the algorithm/library etc that is used is supported by one of the pmml converters – like sklearn2pmml or nyoka)

Update to 2.2.0-SNAPSHOT

Move to 2.2.0-SNAPSHOT version for development

Update python models to specify database

In teradataml if no database is specified it will use the user's default one, but it should be always specified to avoid issues. Can be specified on the create_context and also when accessing tables.
e.i.
on create context

create_context(host=os.environ["AOA_CONN_HOST"],
                   username=os.environ["AOA_CONN_USERNAME"],
                   password=os.environ["AOA_CONN_PASSWORD"],
                   database=data_conf["database"])

or creating a dataframe from a table vs creating and populating a new table

from teradataml.dataframe.dataframe import DataFrame, in_schema

train_df = DataFrame(in_schema(data_conf["database"], data_conf["table"]))
copy_to_sql(df, table_name=data_conf["table"], schema_name=data_conf["database"], if_exists='append', index=False)

Be careful as in these examples there are 3 different parameter names for the same thing.

Add an example with Explainer

Add R example model with RShiny app that uses it

The idea is to have an example together with RShiny app in model code. We will provide additional Deployment Engine logic to deploy it as proper Shiny App, at the moment we are looking for relatable example of model and the app that uses it.

Update pyspark model to PIMA and use S3

We should update the pyspark model to follow the more common usage of pyspark. 99% of spark usage now does not use MLib and so our demo model should not either. Instead it should show the standard xgboost PIMA model except changed to read the dataset via S3.

update to do PIMA with xgboost
read the PIMA dataset from s3 (given we now have always have s3, we should create a new bucket for datasets)

Remove datasets from git repo and strip from history

remove this and clean history so the repo is small

https://github.com/ThinkBigAnalytics/AoaDemoModels/tree/master/model_definitions/b1e18b12-5ccd-4c94-b96b-c2cebf230150/notebooks/Datasets

Careful on this.. I'm not sure what happens with the core clone if you strip / rewrite history.. Will you need to restart the pod (if we have efs fix in place then we need to wipe the git dir first),

Convert and Store PMML for Python XGBoost Diabetes Model

As a machine learning engineer, I want to store a PMML representation of my XGBoost model so that I can deploy it in-database

Subtasks

Add dependency for xgboost to pmml in base training docker images to avoid installations in demos
Add code to convert model to PMML in training.py

Add support for handling unseen categories for VAL OHE

By default VAL's one hot encode (dummy version) only creates features/columns for categories provided at the time of encoder definition. Any undefined or unseen categories are simply ignored without raising any exception. For such scenarios we only have zeros for all defined category columns. This can lead to unknown or undesired effects depending upon the criticality of the feature as well as model assumptions.
One way to handle such situations is to filter out all records with undefined categories. We show how to easily do that in the model code.

Separate evaluation from scoring

Update pyspark example to read from TD using pyspark jdbc

Fix R GBM model for ATs

R tdplyr not applying to all rows

The function mutate_if in tdplyr only applies to the first 100 rows
Applying predicate on the first 100 rows
In tdplyr this cannot be modified as per https://docs.teradata.com/r/2PawNCsNx4~wW3LHnC4kPw/SrbFIj553T6nOM93sXIOuA so a workaround should be implemented, maybe mutate(across())

Replace create_context with aoa_create_context

Use aoa_create_context to simplify the context management in the aoa

Update R model to use tdplyr

Once the library is available in the base images, issue, update the R demo to also use teradata for the data. This should follow the same pattern as we have done for python

R base dependencies incorrectly installed

The base image we are using for R willfleury/r_base:3.4 does not have the dependencies installed correctly. This means it installs everything during demos which is painfully slow for R libraries like gbms.

Minor fix required to the base image Dockerfile

Adding data_stats.json in BYOM pima folder

since new evaluation.py includes recording the data_stats. For Imported models we need to include this file.

from evaluation.py
stats.record_evaluation_stats(DataFrame(data_conf["table"]), DataFrame(predictions_table))

Update models for new base images 4.0

Mainly the PySpark model must be updated to use the new image aoa-pyspark-base:2.4.5

remove cli and update readme

remove the cli
update the demo models readme to just say how to install the aoa module and use the cli

Enable CI builds with release versioning for docker images

Update R GBM example to use PIMA Diabetes for Consistency

The R example model generates the dataset dynamically which isn't great for demos or sample models. We should update to use the pima dataset and run the same model as we use for all languages and frameworks

Add STO MicroModelling Example

Micro modelling example with STO

Remove pyspark example

@dartov @Christian-TD what do you think? I think we should remove support in the core for it also. I can't see a realistic scenario we would support this now.

Fix Vantage model with .2 release of teradataml

With the .2+ release of teradataml, it no longer wraps mixed case column names in quotes when it passes the column names to the sql engine. For example, it used to be

NumericInputs('"NumTimesPrg"')

but now is

NumericInputs('NumTimesPrg')

This means that if your table has mixed case, the engine won't find them when it goes to score. To solve this, it means we should be doing what is best practice anyway and not using mixed case column names.

Improve pyspark model repository support

pyspark models won't follow the same pattern as simple python models in that they cannot store the mlib model in the local model folder of the driver. By its nature it needs a distributed storage. Usually this is HDFS and sometimes its S3 or similar.

Depending on the model artefact store chosen, the framework may need to tell the spark job where to temporarily store the model files on HDFS or similar when finished training so it can then upload later to the correct storage. Similar effort is requried for scoring as we need to make the model files available somewhere for spark to read.

The approach we are working with so far is for the framework to add a property to the spark submit which allows for the base path to be configured on a per project basis via the project.metadata and so that it will always append the relevant model version (trainedModel.id) to avoid conflicts.

spark.conf.get("spark.aoa.modelPath")

Update byom evaluation.py example to use IVSM.IVSM_SCORE instead of IVSM_SCORE

Add h2o model in R

Model developed in a workshop for Abanca

Remove scheduler/dataset_template.json from the models

caused by https://github.com/ThinkBigAnalytics/AoaCoreService/issues/437

Add dataset statistics

Capture Model Drift for BYOM evaluation example

Now that we support model drift on BYOM, we need to update the evaluation.py example to include capturing these metrics

Add Scoring Monitoring Metrics Capture

More info https://github.com/ThinkBigAnalytics/AoaCoreService/issues/604

Update MLE Models to use model-version for Dynamic Table Name

Implement demo-git Dockerfile

Change code in evaluation.py for BYOM

We should record stats only if file of data_stats is available. Since this file is not mandatory in UI users can get errors in evaluation process. Evaluation.py of BYOM changes:

some pseudocode:
IF DATA_STATS.JSON IS AVAILABLE{
stats.record_evaluation_stats(DataFrame(data_conf["table"]), DataFrame(predictions_table))

Refactor resource configuration in model.json

With https://github.com/ThinkBigAnalytics/AoaCoreService/pull/316, we changed the template structure to get the resources from automation.resources, which requires changing the current model.json configuration from:

{
	"automation": {
		...
	},

	"resources": {
		"training": {
			"limits": {
				"cpu": "200m",
				"memory": "100Mi",
                                "gpu":"1"
			}
		}
	}
}

{
	"automation": {
		"resources": {
				"cpu": "200m",
				"memory": "100Mi",
                                "gpu":"1"
		}
	}
}

Add Demo App Portal for Diabetes Check

To make the demos better, we should add a sample website for the PIMA diabetes RESTful model serving to call it and score to see if patient has diabetes.

Update micro model to use teradataml instead of tdextensions

Add support for credentials

Update Diabetes Python XGBoost for Batch Airflow and Evaluation Artefacts

Need to remove init.py

Need to remove init.py file from the path AoaDemoModels/model_templates/ as it results in failure of the script repo-cli.py

Update PIMA Diabetes XGBoost Demo Model to use S3 for datasets

Currently the xgboost demo model for diabetes which we use in almost all our demos reads the dataset from the internet via a url. Pandas now supports reading from s3 via pd.read_csv also. This would be really powerful and as we include an embedded s3 via S3Proxy and also minio with docker compose, we should be able to include the necessary files as part of the startup process for the demo containers.

For example, we should have a demo-data bucket which has the diabetes training and evaluation datasets in there (already split). We should also be able to write the results of scoring back into this under a different folder so we can show e.g. the results of airflow execution!

@nunomachado fyi this is what i was talking about - no rush on it - just an fyi

Simplify sto example

Cleanup Demo Models

Move MLE models to separate project/repo
Update Demo Models to use Vantage
Update Tensorflow Example

Unlike shap, this allows to produce the explainer once (training) and then save it (serialize) so you can load it later and use it for individual prediction explaination.

Add the explain endpoint in the scoring.py->ModelScorer.explain()

run-model-cli.py throws No such file or directory: 'models/****' error while evaluation

run-model-cli.py throws No such file or directory: 'models/****' error while evaluation.

thinkbiganalytics / aoademomodels Goto Github PK

aoademomodels's People

Contributors

Stargazers

Watchers

Forkers

aoademomodels's Issues

Recommend Projects

Recommend Topics

Recommend Org