Giter VIP home page Giter VIP logo

liga's People

Contributors

bobingm avatar changhiskhan avatar chunyang avatar da-liii avatar da-tubi avatar eddyxu avatar ffcai avatar gitter-badger avatar renkai avatar smellslikeml avatar xujie8410 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

liga's Issues

`ML_TRANSFORM` and New flavor: Spark ML

Conclusion

The Spark ML new flavor requires ML_TRANSFORM but not ML_PREDICT.

The difference between ML_TRANSFORM and ML_PREDICT is the size of the model.

ML_PREDICT is implemented using PySpark pandas_udf, it works well with small models which can be loaded in one node. ML_TRANSFORM is for big models which can not be loaded in one node.

Two Previous Attempts by Renkai

eto-ai/rikai#338
Tried to implement ML_PREDICT for SparkML like @da-tubi did for eto-ai/rikai#326 , but it's much more complex than I thought, maybe the best way to complete it is to implement ML_PREDICT UDF for SparkML in Scala, so the worker will not need SparkContext to get a proper set JVM.

However, it's independent with this issue, we can still implement training by SparkML feature, just can't use ML_PREDICT for SparkML.


eto-ai/rikai#343
Another try to implement ML_PREDICT for SparkMl failed, though I already try it in scala, the key issue that caused the failure is SparkML model can only deal with Dataset, which is not attachable in UDF, we need to replace ML_PREDICT to driver side code generator not only another UDF.

Let us try again

Demo Code: RandomForestClassifier

https://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-classifier

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.{RandomForestClassificationModel, RandomForestClassifier}
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer}

// Load and parse the data file, converting it to a DataFrame.
val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

// Index labels, adding metadata to the label column.
// Fit on whole dataset to include all labels in index.
val labelIndexer = new StringIndexer()
  .setInputCol("label")
  .setOutputCol("indexedLabel")
  .fit(data)
// Automatically identify categorical features, and index them.
// Set maxCategories so features with > 4 distinct values are treated as continuous.
val featureIndexer = new VectorIndexer()
  .setInputCol("features")
  .setOutputCol("indexedFeatures")
  .setMaxCategories(4)
  .fit(data)

// Split the data into training and test sets (30% held out for testing).
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))

// Train a RandomForest model.
val rf = new RandomForestClassifier()
  .setLabelCol("indexedLabel")
  .setFeaturesCol("indexedFeatures")
  .setNumTrees(10)

// Convert indexed labels back to original labels.
val labelConverter = new IndexToString()
  .setInputCol("prediction")
  .setOutputCol("predictedLabel")
  .setLabels(labelIndexer.labelsArray(0))

// Chain indexers and forest in a Pipeline.
val pipeline = new Pipeline()
  .setStages(Array(labelIndexer, featureIndexer, rf, labelConverter))

// Train model. This also runs the indexers.
val model = pipeline.fit(trainingData)

// Make predictions.
val predictions = model.transform(testData)

// Select example rows to display.
predictions.select("predictedLabel", "label", "features").show(5)

// Select (prediction, true label) and compute test error.
val evaluator = new MulticlassClassificationEvaluator()
  .setLabelCol("indexedLabel")
  .setPredictionCol("prediction")
  .setMetricName("accuracy")
val accuracy = evaluator.evaluate(predictions)
println(s"Test Error = ${(1.0 - accuracy)}")

val rfModel = model.stages(2).asInstanceOf[RandomForestClassificationModel]
println(s"Learned classification forest model:\n ${rfModel.toDebugString}")

MLflow API

Apply SpectralClustering on datasets

Problem

https://scikit-learn.org/1.1/modules/generated/sklearn.cluster.SpectralClustering.html#sklearn.cluster.SpectralClustering

There are no predict or transform in SpectralClustering. And no model will be logged in to MLflow.

How can we apply these kind of model on the dataset?

Here is the code snippet to use SpectralClustering:

>>> from sklearn.cluster import SpectralClustering
>>> import numpy as np
>>> X = np.array([[1, 1], [2, 1], [1, 0],
...               [4, 7], [3, 5], [3, 6]])
>>> clustering = SpectralClustering(n_clusters=2,
...         assign_labels='discretize',
...         random_state=0).fit(X)
>>> clustering.labels_
array([1, 1, 1, 0, 0, 0])
>>> clustering
SpectralClustering(assign_labels='discretize', n_clusters=2,
    random_state=0)

Analysis

Case 1: small dataset

Take sklearn SpectralCluster for example, it is not a distributed ML model.

For Large Scaled Data which could not be loaded into one executor, SpectralCluster should not be applied.

Case 2: big dataset

https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.clustering.PowerIterationClustering.html#pyspark.ml.clustering.PowerIterationClustering

Liga README

Liga: the ML-Enhanced Spark SQL

Design

Liga is general-purpose ML-enhanced SQL framework designed to be modular, extensible and scalable.

Spark SQL and MLflow
Currently, Liga depends on Spark SQL. It does not mean Liga will always be a Apache Spark based project. Just like that MLflow registry is an optional registry in Liga, Spark SQL should and could also be an option.
For Prediction but not for Training
Thanks to the ML Model Registries, the training and the prediction can be separated. Liga is designed to apply ML Models via SQL syntax on datasets but not designed to train ML models via SQL syntax for now. Let us focus on applying and also serving ML models first!
General Purpose but not Domain Specific
Integration with a specific model, a specific ML framework or a specific domain like vision or audio should not be maintained in this repository. They should be maintained in separate projects depending on Liga. eg. A separate project `liga-vision` for integrating the computer vision related UDF and UDT (eg. Image, Box2d) or a separeate project `liga-pytorch` for integrating the PyTorch framework. Scikit-learn integration and Spark ML integration are the reference implementation. That's why they are maintained in Liga.

Live Notebooks

Latest Notebook
Latest Notebook is the in-repo Jupyter notebook, it depends on the latest code in this repo. You need to clone this repo and then launch the Jupyter Notebooks locally. You can also click the link below to preview it on Github.
Google Colab Notebook
Google Colab Notebook is the notebook depends on the latest stable release of Liga. The link below helps you open the notebook under the latest stable release of Liga in Google Colab. You can try the live Google Colab notebooks just using a web browser.

Try the latest notebooks:

git clone https://github.com/liga-ai/liga.git
bin/lab sklearn

Preview the notebooks on Github or Try the live notebook on Google Colab:

Model Model Type Official Documentation Preview Latest Notebook Try Google Colab Notebook
LinearRegression regressor Linear Models Ordinary Least Squares Demo
LogisticRegression classifier Logistic regression Demo
Ridge regressor Ridge regression and classification Demo
RidgeClassifier classifier Demo
SVC/NuSVC/LinearSVC classifier Support Vector Machines Demo
SVR/NuSVR/LinearSVR regressor Demo
RandomForestClassifier classifier Forests of randomized trees Demo
RandomForestRegressor regressor Demo
ExtraTreesClassifier classifier Demo
ExtraTreesRegressor regressor Demo
KMeans cluster Clustering Demo
PCA transformer Decomposing signals in components (matrix factorization problems) Demo

Liga SQL References

SQL: ML_PREDICT for small models

SELECT
  id,
  ML_PREDICT(my_yolov5, image)
FROM cocodataset 

ML_PREDICT is a special UDF which takes two parameters:

  • model_name is a special parameter look likes an identifier
  • data

SQL: ML_TRANSFORM for big models

TODO (see #9 )

SQL: Model Creation

A Model instance is created by specifying the model flavor, type and options on the uri.

-- Create model
CREATE [OR REPLACE] MODEL model_name
[FLAVOR flavor]
[MODEL_TYPE model_type]
[OPTIONS (key1=value1,key2=value2,...)]
USING "uri";
  • flavor: eg. liga.sklearn => from liga.sklearn.codegen import codegen
  • model_type: eg. classifier => from liga.sklearn.models.classifier import MODEL_TYPE

SQL: Model Catalog

-- Describe model
{ DESC | DESCRIBE } MODEL model_name;

-- Show all models
SHOW MODELS;

-- Delete a model
DROP MODEL model_name;

Python API

Model Type

A Model Type encaptures the interface and schema of a concrete ML model. It acts as an adaptor between the raw ML model input/output Tensors and Spark / Pandas.

Here is the key code snippet of the sklearn classifier model type (liga.sklearn.models.classifier):

class Classifier(SklearnModelType):
    """Classification model type"""

    def schema(self) -> str:
        return "int"

    def predict(self, *args: Any, **kwargs: Any) -> List[int]:
        assert self.model is not None
        assert len(args) == 1
        return self.model.predict(args[0]).tolist()

Model Flavor

A Flavor describes the framework upon which the model was built.

A Liga model flavor should provide:

generate_udf
to construct a Pandas UDF to run flavor-specific models. The special UDF `ML_PREDICT` will be translated into the generated pandas udf per flavor.
load_model_from_uri
to load models from filesystem URI for `FileSystemRegistry`. Because there are different ways to load a model from a filesystem URI for different ML frameworks. Model Registries like MLflow unify the way to load a model from the registry. That's why for those model registries, a URI (eg. `mlflow:///yolov5`) is sufficient.

Supported flavors:

  • sklearn (provided by liga-sklearn)
  • pytorch (provided by liga-pytorch)
  • ...

Model Registry

A model registry specifies where and how to load a model.

Name Pypi URI
DummyRegistry liga A special registry without URI provided. How and where to load model is hard-coded in model types, eg. torchvision.models.resnet50().
FileSystemRegistry liga http:///,file:///,s3:///,...
MLflowRegistry liga-mlflow mlflow:/// MLflowRegistry is the recommended production-ready model registry.

Model Catalog

Currently, only a in-memory model catalog is available in Liga. Via Model Catalog, ML enhanced-SQL users only needs focus on how to apply ML-enhanced SQL on datasets at scale. Models are carefully maintained by Data/ML Engineers or Data Scientists.

WARNING: Python API to customize the Model Catalog is not yet provided!

History

Liga is the ML-enhanced SQL part of Rikai. Rikai is created by @changhiskhan and @eddyxu and the first release of Rikai dates back to 2021/04/04. @da-tubi and @Renkai created the Liga fork of Rikai as a project of the 4th Tubi Hackathon (#4).

Liga: the ML-enhanced SQL part of Rikai (The Hackathon)

Hackathon

by @da-tubi and @Renkai

Here is the result of the hackathon:

Liga v.s. Rikai

Liga v.s. Rikai (Linguistic)

The pronounciation of Liga in Wuu Chinese and Rikai in Japanese is almost the same.
The meaning of Liga or Rikai is understanding in English or 理解 in Chinese.

Liga v.s. Rikai (Demo Notebook)

Liga v.s. Rikai (Software Engineering)

Rikai is too complicated to maintain and it is dedicated for computer vision. Liga is designed to modularize Rikai, we only need the ML-enhanced SQL part of Rikai:

  • the ML_PREDICT magic syntax for Spark SQL (also ML_TRANSFORM , ML_FORECAST )
  • MLflow integration
  • ModelType design
- Rikai: ML_PREDICT/ModelType/MLflow/PyTorch/Computer Vision/Rikai format
+ Liga: ML_PREDICT/ModelType/MLflow/Sklearn

Liga v.s. Rikai (Pypi)

rikai

  • rikai 0.1.15

liga

  • liga 0.2.0dev3 0.2.0
    • liga-mlflow 0.2.0
    • liga-sklearn 0.2.0
    • liga-torch
    • liga-vision
      • liga-torchvision

Liga

Using the Pants build tool

Enable Python Linter and Checker

Before

# no linter and cheker

After

bin/lint
bin/check

How to test

Before

sbt publishLocal

cd python
# create a python virtualenv
pip install -e . # pip install -e .[all] to install all optional extras (see "Install from pypi")

Now

sbt assembly

bin/test

Launch the Jupyter Lab

Before

sbt publishLocal

cd python
# create a python virtualenv
pip install -e . # pip install -e .[all] to install all optional extras (see "Install from pypi")
jupyterlab

Now

sbt assembly

bin/lab sklearn

Plans

The Liga project is started as a hackathon project at Tubi. Here is the Plans:

  1. Fork Rikai to Liga and remove the not-used code
  2. Integrate liga-sklearn to verify the usability of Liga
  3. #17

Publish to pypi

  • liga 0.2.0
  • liga-sklearn 0.2.0: the built-in and demo integration between Liga and Sklearn
    • liga
  • liga-mlflow 0.2.0: the integration between Liga and MLflow
    • liga

Extensible Registeries

Here is the list of hard coded registeries:

  • MlflowRegistry
  • FileSystemRegistry
  • DummyRegistry
  • TFHubRegistry
  • TorchHubRegistry

Make it extensible, so that we do not need to code in Scala and Python to add a new registery! Just Python needed.

Import unit tests from rikai

In #2, we removed failed unit tests directly. Some of them are important unit tests:

liga/mlflow

liga/registry

liga/sklearn

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.