Giter VIP home page Giter VIP logo

mlops_template_bqml_e2e's Introduction

Dummy BQML system E2E implementation

This is a sample project that illustrates how to use Vertex AI, BQML and Cloud Build to build a fully functional ML system comprising of the following functionalities:

  • Training Pipeline: implemented using Vertex AI Pipeline, we showcase a simple pipeline that executes BQML queries to train a model.
  • Batch scoring Pipeline: similar to the training pipeline, also this pipeline is implemented using Vertex AI Pipeline. It will perform predictions for the BQML model.
  • CICD Pipeline: implemented with Cloud Build, this pipeline will execute dummy unit tests, compile the pipelines, test them and then simulate a deployment to production with schedule enabled.

As part of this repository we guide the user in setting up the system in a matter of minutes. We will show how to execute training, scoring and how to setup CICD for both pipelines, by creating a dummy repository to allow for automatic trigger.

The system implemented in this repository is a dummy fraud detection model, implemented using the same dataset of FraudFinder

Supercharging BQML with Jinja

A key feature of the BQML queries defined in this repository is that they support Jinja2 templating. This extends the capabilities of SQL by introducing scripting, allowing developers to define complex logic in the SQL such as:

An example of that is the model name and model parameters of the SQL file which creates the model in fraud_detector/training/sql/01_train_model.sql:

CREATE OR REPLACE MODEL `{{ model_name }}`
        OPTIONS (
            MODEL_TYPE='LOGISTIC_REG',
            INPUT_LABEL_COLS=['tx_fraud'],
            EARLY_STOP=TRUE,
            MAX_ITERATIONS={{ model_parameters['max_iterations'] }},
            model_registry='vertex_ai',
            vertex_ai_model_id='fraud_detector',
            vertex_ai_model_version_aliases=['experimental','{{ now().strftime('%Y_%m_%d') }}']
        )
        AS SELECT * FROM tx.feature_table

You can notice that the name of the model or the variable for MAX_ITERATIONS are parametrised in the SQL query. This allows the user to control these parameters directly in the configuration file for the training pipeline to reduce duplication of logic.

pipeline_name: dummy-predictor-training
experiment_name: dummy-predictor-exp
pipeline_schedule: "1 * * * *"
timezone: "Europe/Berlin"

labels:
  team: team_name

parameters:
  model_name: "tx.fraud_detector"
  model_parameters:
    max_iterations: 20

Jinja templating is made possible by creating a custom KFP Component which executes SQL queries. You can find the definition of the component here

GCP environment setup

bash utils/scripts/setup.sh

Setup local environment

Setup environment:

virtualenv venv && source venv/bin/activate && pip install -r requirements.txt

Import training data

Run the following script to download the data in BQ:

python utils/scripts/copy_bigquery_data.py --project $PROJECT_ID --location  "us-central1"

Pipelines

Trigger training pipeline

make training-pipeline

Trigger batch scoring pipeline

make batch-scoring-pipeline

CICD

Run Cloud build pipeline for training (training pipeline is default in Cloud Build)

gcloud builds submit --config build/test_and_deploy_pipeline.yaml

Run Cloud build pipeline for batch scoring

gcloud builds submit --config build/test_and_deploy_pipeline.yaml --substitutions="_PIPELINE_FOLDER=fraud_detector/batch_scoring"

Create a repository

We create a repository in Cloud Source repository to showcase how a commit can trigger Cloud Build

REPOSITORY_NAME=fraud_detector
gcloud source repos create $REPOSITORY_NAME --project=$PROJECT_ID

git init && git config --global credential.https://source.developers.google.com.helper gcloud.sh
git remote add origin https://source.developers.google.com/p/$PROJECT_ID/r/$REPOSITORY_NAME

Please note that we copy the file for

You will need now to setup 2 triggers, for the 2 vertex pipelines you support. You would like to achieve the following behaviour:

  • Every time a file under fraud_detector/training changes, you want to trigger a new deployment of CloudBuild for the batch scoring pipeline:
gcloud beta builds triggers create cloud-source-repositories \
    --name=fd-training \
    --repo=$REPOSITORY_NAME \
    --branch-pattern=main \
    --build-config=build/test_and_deploy_pipeline.yaml \
    --substitutions="_PIPELINE_FOLDER=fraud_detector/training" \
    --included-files="fraud_detector/training/*"
  • Every time a file under fraud_detector/batch_scoring changes, you want to trigger a new deployment of CloudBuild for the batch scoring pipeline:
gcloud beta builds triggers create cloud-source-repositories \
    --name=fd-batch-scoring \
    --repo=$REPOSITORY_NAME \
    --branch-pattern=main \
    --build-config=build/test_and_deploy_pipeline.yaml \
    --substitutions="_PIPELINE_FOLDER=fraud_detector/batch_scoring" \
    --included-files="fraud_detector/batch_scoring/*"

Make your first commit

cd $REPOSITORY_NAME && git add .
git config --global user.email "your_email"
git config --global user.name "your_email"
git commit -a -m "initial commit"
git push --set-upstream origin main

mlops_template_bqml_e2e's People

Stargazers

 avatar  avatar Julien Miquel avatar

Watchers

Julien Miquel avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.