Giter VIP home page Giter VIP logo

thinkbiganalytics / aoademomodels Goto Github PK

View Code? Open in Web Editor NEW
6.0 6.0 29.0 2 MB

This repository contains the example / demo models for Teradata AnalyticOps (ModelOps). The goal of these examples is to provide a simple reference implementation for users and not to provide a detailed data science example for each use case. We provide both Python and R examples along with Classification and Regression model types.

License: BSD 3-Clause "New" or "Revised" License

Python 5.14% Jupyter Notebook 92.00% R 2.87%

aoademomodels's People

Contributors

christian-td avatar dartov avatar ks2907 avatar nunomachado avatar um255003 avatar willfleury avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

aoademomodels's Issues

Add MLE Models Project

The demo MLE models which were here should be moved into a dedicated MLE project.

  • request new repo creation AoaMleDemoModels
  • add the models.
  • request that this new project is added to the demo-data then.

Add a regression model

  • Demand forecasting example would be idea.. But other ideas are welcome which don’t include time..
  • Ideally with one categorical input.. Doesn’t have to be but from a monitoring testing perspective this would be ideal..
  • Something simple like we have for Classification (demo diabetes). The goal is to show the workflow, not some complex data science example/problem
  • Data obviously should be in Teradata (our demo env)
  • Python code
  • Should be convertible to pmml (the algorithm/library etc that is used is supported by one of the pmml converters – like sklearn2pmml or nyoka)

Update python models to specify database

In teradataml if no database is specified it will use the user's default one, but it should be always specified to avoid issues. Can be specified on the create_context and also when accessing tables.
e.i.
on create context

create_context(host=os.environ["AOA_CONN_HOST"],
                   username=os.environ["AOA_CONN_USERNAME"],
                   password=os.environ["AOA_CONN_PASSWORD"],
                   database=data_conf["database"])

or creating a dataframe from a table vs creating and populating a new table

from teradataml.dataframe.dataframe import DataFrame, in_schema

train_df = DataFrame(in_schema(data_conf["database"], data_conf["table"]))
copy_to_sql(df, table_name=data_conf["table"], schema_name=data_conf["database"], if_exists='append', index=False)

Be careful as in these examples there are 3 different parameter names for the same thing.

Add R example model with RShiny app that uses it

The idea is to have an example together with RShiny app in model code. We will provide additional Deployment Engine logic to deploy it as proper Shiny App, at the moment we are looking for relatable example of model and the app that uses it.

Update pyspark model to PIMA and use S3

We should update the pyspark model to follow the more common usage of pyspark. 99% of spark usage now does not use MLib and so our demo model should not either. Instead it should show the standard xgboost PIMA model except changed to read the dataset via S3.

  • update to do PIMA with xgboost
  • read the PIMA dataset from s3 (given we now have always have s3, we should create a new bucket for datasets)

Convert and Store PMML for Python XGBoost Diabetes Model

As a machine learning engineer, I want to store a PMML representation of my XGBoost model so that I can deploy it in-database

Subtasks

  • Add dependency for xgboost to pmml in base training docker images to avoid installations in demos
  • Add code to convert model to PMML in training.py

Add support for handling unseen categories for VAL OHE

By default VAL's one hot encode (dummy version) only creates features/columns for categories provided at the time of encoder definition. Any undefined or unseen categories are simply ignored without raising any exception. For such scenarios we only have zeros for all defined category columns. This can lead to unknown or undesired effects depending upon the criticality of the feature as well as model assumptions.
One way to handle such situations is to filter out all records with undefined categories. We show how to easily do that in the model code.

Update R model to use tdplyr

Once the library is available in the base images, issue, update the R demo to also use teradata for the data. This should follow the same pattern as we have done for python

R base dependencies incorrectly installed

The base image we are using for R willfleury/r_base:3.4 does not have the dependencies installed correctly. This means it installs everything during demos which is painfully slow for R libraries like gbms.

Minor fix required to the base image Dockerfile

Adding data_stats.json in BYOM pima folder

since new evaluation.py includes recording the data_stats. For Imported models we need to include this file.

from evaluation.py
stats.record_evaluation_stats(DataFrame(data_conf["table"]), DataFrame(predictions_table))

Fix Vantage model with .2 release of teradataml

With the .2+ release of teradataml, it no longer wraps mixed case column names in quotes when it passes the column names to the sql engine. For example, it used to be

NumericInputs('"NumTimesPrg"')

but now is

NumericInputs('NumTimesPrg')

This means that if your table has mixed case, the engine won't find them when it goes to score. To solve this, it means we should be doing what is best practice anyway and not using mixed case column names.

Improve pyspark model repository support

pyspark models won't follow the same pattern as simple python models in that they cannot store the mlib model in the local model folder of the driver. By its nature it needs a distributed storage. Usually this is HDFS and sometimes its S3 or similar.

Depending on the model artefact store chosen, the framework may need to tell the spark job where to temporarily store the model files on HDFS or similar when finished training so it can then upload later to the correct storage. Similar effort is requried for scoring as we need to make the model files available somewhere for spark to read.

The approach we are working with so far is for the framework to add a property to the spark submit which allows for the base path to be configured on a per project basis via the project.metadata and so that it will always append the relevant model version (trainedModel.id) to avoid conflicts.

spark.conf.get("spark.aoa.modelPath")

Change code in evaluation.py for BYOM

We should record stats only if file of data_stats is available. Since this file is not mandatory in UI users can get errors in evaluation process. Evaluation.py of BYOM changes:

some pseudocode:
IF DATA_STATS.JSON IS AVAILABLE{
stats.record_evaluation_stats(DataFrame(data_conf["table"]), DataFrame(predictions_table))

Add Demo App Portal for Diabetes Check

To make the demos better, we should add a sample website for the PIMA diabetes RESTful model serving to call it and score to see if patient has diabetes.

Need to remove __init.py__

Need to remove init.py file from the path AoaDemoModels/model_templates/ as it results in failure of the script repo-cli.py

Update PIMA Diabetes XGBoost Demo Model to use S3 for datasets

Currently the xgboost demo model for diabetes which we use in almost all our demos reads the dataset from the internet via a url. Pandas now supports reading from s3 via pd.read_csv also. This would be really powerful and as we include an embedded s3 via S3Proxy and also minio with docker compose, we should be able to include the necessary files as part of the startup process for the demo containers.

For example, we should have a demo-data bucket which has the diabetes training and evaluation datasets in there (already split). We should also be able to write the results of scoring back into this under a different folder so we can show e.g. the results of airflow execution!

@nunomachado fyi this is what i was talking about - no rush on it - just an fyi

Cleanup Demo Models

  • Move MLE models to separate project/repo
  • Update Demo Models to use Vantage
  • Update Tensorflow Example

Add local explainability implementation

Add local explainability example for Python XGBoost using the following package

https://pypi.org/project/alibi/

Unlike shap, this allows to produce the explainer once (training) and then save it (serialize) so you can load it later and use it for individual prediction explaination.

Add the explain endpoint in the scoring.py->ModelScorer.explain()

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.