Giter VIP home page Giter VIP logo

bigquery-ml-utils's Introduction

BigQuery ML Utils

BigQuery ML (aka. BQML) lets you create and execute machine learning models in BigQuery using standard SQL queries. The BigQuery ML Utils library is an integrated suite of machine learning tools for building and using BigQuery ML models.

Installation

Install this library in a virtualenv using pip. virtualenv is a tool to create isolated Python environments. The basic problem it addresses is one of dependencies and versions, and indirectly permissions.

With virtualenv, it's possible to install this library without needing system install permissions, and without clashing with the installed system dependencies.

Mac/Linux

    pip install virtualenv
    virtualenv <your-env>
    source <your-env>/bin/activate
    <your-env>/bin/pip install bigquery-ml-utils

Windows

    pip install virtualenv
    virtualenv <your-env>
    <your-env>\Scripts\activate
    <your-env>\Scripts\pip.exe install bigquery-ml-utils

Overview

Inference

Transform Predictor

The Transform Predictor feeds input data into the BQML model trained with TRANSFORM. It performs both preprocessing and postprocessing on the input and output. The first argument is a SavedModel which represents the TRANSFORM clause for feature preprocessing. The second argument is a SavedModel or XGBoost Booster which represents the model logic.

XGBoost Predictor

The XGBoost Predictor feeds input data into the BQML XGBoost model. It performs both preprocessing and postprocessing on the input and output. The first argument is a XGBoost Booster which represents the model logic. The following arguments are model assets.

Tensorflow Ops

BQML Tensorflow Custom Ops provides SQL functions (Date functions, Datetime functions, Time functions and Timestamp functions) that are not available in TensorFlow. The implementation and function behavior align with the BigQuery. This is part of an effort to bridge the gap between the SQL community and the Tensorflow community. The following example returns the same result as TIMESTAMP_ADD(timestamp_expression, INTERVAL int64_expression date_part)

>>> timestamp = tf.constant(['2008-12-25 15:30:00+00', '2023-11-11 14:30:00+00'], dtype=tf.string)
>>> interval = tf.constant([200, 300], dtype=tf.int64)
>>> result = timestamp_ops.timestamp_add(timestamp, interval, 'MINUTE')
tf.Tensor([b'2008-12-25 18:50:00.0 +0000' b'2023-11-11 19:30:00.0 +0000'], shape=(2,), dtype=string)

Note: /usr/share/zoneinfo is needed for parsing time zone which might not be available in your OS. You will need to install tzdata to generate it. For example, add the following code in your Dockerfile.

RUN apt-get update && DEBIAN_FRONTEND="noninteractive" \
    TZ="America/Los_Angeles" apt-get install -y tzdata

Model Generator

Text Embedding Model Generator

The Text Embedding Model Generator automatically loads a text embedding model from Tensorflow hub and integrates a signature such that the resulting model can be immediately integrated within BQML. Currently, the NNLM and BERT embedding models can be selected.

NNLM Text Embedding Model

The NNLM model has a model size of <150MB and is recommended for phrases, news, tweets, reviews, etc. NNLM does not carry any default signatures because it is designed to be utilized as a Keras layer; however, the Text Embedding Model Generator takes care of this.

SWIVEL Text Embedding Model

The SWIVEL model has a model size of <150MB and is recommended for phrases, news, tweets, reviews, etc. SWIVEL does not require pre-processing because the embedding model already satisfies BQML imported model requirements. However, in order to align signatures for NNLM, SWIVEL, and BERT, the Text Embedding Model Generator establishes the same input label for SWIVEL.

BERT Text Embedding Model

The BERT model has a model size of ~200MB and is recommended for phrases, news, tweets, reviews, paragraphs, etc. The BERT model does not carry any default signatures because it is designed to be utilized as a Keras layer. The Text Embedding Model Generator takes care of this and also integrates a text preprocessing layer for BERT.

bigquery-ml-utils's People

Contributors

faizan-m avatar junyazhang avatar liujiashang avatar xiaoqiuh avatar yinguangzhao avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bigquery-ml-utils's Issues

IMPORTANT: incorrect estimate values of logistic regression (especially when EARLY_STOP = TRUE)

I'm not sure here is the correct place to tell you this fact ๐Ÿ™‡โ€โ™€๏ธ
Please share this fact with the correct people if necessary.

(1) Summary:

  • estimates of logistic regression in BQML are wrong (i.e., the values are not close to the maximum likelihood estimator)
  • this is because early_stop and batch_gradient_descent

(2) Suggestion:

  • set default value of EARLY_STOP FALSE
  • remove the feature to compute the p-value
  • implement the Newton-Raphson method (2nd conv.) and apply

(both, especially when CALCULATE_P_VALUES = TRUE)

I understand the above suggestion may not suit the design strategy as BQ"ML" (it's ML, but not statistics).
So, there should be better solutions.

(3) Numerical Example:

The following queries have the same data, so the estimates should be the same.
In this example, the coefficient of x is ln3 = 1.0986...
But, coef. in query1 is 1.14 (wrong)
and coef. in query2 (no early_stop) is 1.0986... (correct!)

data

Screenshot 2024-06-21 at 15 37 35

result

result of query 1 (with early stop)
aa

result of query 2 (withOUT early stop)
bb

queries

Query1:

  model `project.dataset.model1`
  options (
    input_label_cols         = ['y'],
    model_type               = 'logistic_reg',
    data_split_method        = 'no_split',
    max_iteration            = 15,
    l1_reg                   = 0,
    l2_reg                   = 0,
  )
  as 

 with arrays as (
  select
    array[-1, -1, -1, -1, 1, 1, 1, 1] as x,
    array[0, 0, 0, 1, 0, 1, 1, 1] as y
  )
  select
    x,
    y
  from arrays,
  unnest(x) as x with offset as index_x
  join unnest(y) as y with offset as index_y
  on index_x = index_y

Query2:

  model `project.dataset.model2`
  options (
    input_label_cols         = ['y'],
    model_type               = 'logistic_reg',
    data_split_method        = 'no_split',
    max_iteration            = 15,
    l1_reg                   = 0,
    l2_reg                   = 0,
    optimize_strategy        = 'batch_gradient_descent',
    early_stop               = false  # DIFFERENCE HERE
  )
  as 

 with arrays as (
  select
    array[-1, -1, -1, -1, 1, 1, 1, 1] as x,
    array[0, 0, 0, 1, 0, 1, 1, 1] as y
  )
  select
    x,
    y
  from arrays,
  unnest(x) as x with offset as index_x
  join unnest(y) as y with offset as index_y
  on index_x = index_y

(4) Practical example

I can't share the original data because it is highly confidential. So, I share an abstract situation.

The numerical difference is as follows:
Screenshot 2024-06-21 at 15 41 39

These are the very first coefficients of 100+ variables, 100k+ data (in my job).

When we believe the statsmodels is true, BQML cannot estimate correct estimated values.

Additional information 1: about this model

  • var_1 ~ var_3 are one-hot variables from one categorical variable.
  • and, for many data, var_1 + var_2 + var_3 = 1, so there is little possibility of multi-colinearity (but max(VIF) = 5. so we can say this is not the case of multi-colinearity)

Additional information 2: loss curve and estimation algorithm

Loss curve of this estimation is as follows.
So, as "ML," it is very natural to early_stop (because no loss improvement)
But, as "statistics," it is problematic since the estimated values are far from correct. (This, p-values are misleading)

Screenshot 2024-06-21 at 12 43 14

Support non-TRANSFORM model Predictors

Would it be possible to make a tensorflow predictor that did not expect the model to be queried with the TRANSFORM command, but rather the PREDICT command? This library is great for locally using exported models from BQML, but I am manually having to use the Predictor._get_transform_result() function to bypass 2-stage TRANSFORM+PREDICT logic.

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.