komprenilo / liga Goto Github PK

View Code? Open in Web Editor NEW

10.0 2.0 5.0 18.31 MB

Liga: Let Data Dance with ML Models

License: Apache License 2.0

Scala 28.30% Python 41.43% ANTLR 0.94% Shell 0.73% Jupyter Notebook 28.60%

computervison mlflow mlops spark-sql ffmpeg machine-learning scikit-learn made-on-earth

liga's People

Contributors

Stargazers

Watchers

Forkers

renkai cailurus andyzju da-tubi xujie8410

liga's Issues

Unit test for the dummy registry

`ML_TRANSFORM` and New flavor: Spark ML

Conclusion

The Spark ML new flavor requires ML_TRANSFORM but not ML_PREDICT.

The difference between ML_TRANSFORM and ML_PREDICT is the size of the model.

ML_PREDICT is implemented using PySpark pandas_udf, it works well with small models which can be loaded in one node. ML_TRANSFORM is for big models which can not be loaded in one node.

Two Previous Attempts by Renkai

eto-ai/rikai#338
Tried to implement ML_PREDICT for SparkML like @da-tubi did for eto-ai/rikai#326 , but it's much more complex than I thought, maybe the best way to complete it is to implement ML_PREDICT UDF for SparkML in Scala, so the worker will not need SparkContext to get a proper set JVM.

However, it's independent with this issue, we can still implement training by SparkML feature, just can't use ML_PREDICT for SparkML.

eto-ai/rikai#343
Another try to implement ML_PREDICT for SparkMl failed, though I already try it in scala, the key issue that caused the failure is SparkML model can only deal with Dataset, which is not attachable in UDF, we need to replace ML_PREDICT to driver side code generator not only another UDF.

Let us try again

Demo Code: RandomForestClassifier

https://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-classifier

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.{RandomForestClassificationModel, RandomForestClassifier}
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer}

// Load and parse the data file, converting it to a DataFrame.
val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

// Index labels, adding metadata to the label column.
// Fit on whole dataset to include all labels in index.
val labelIndexer = new StringIndexer()
  .setInputCol("label")
  .setOutputCol("indexedLabel")
  .fit(data)
// Automatically identify categorical features, and index them.
// Set maxCategories so features with > 4 distinct values are treated as continuous.
val featureIndexer = new VectorIndexer()
  .setInputCol("features")
  .setOutputCol("indexedFeatures")
  .setMaxCategories(4)
  .fit(data)

// Split the data into training and test sets (30% held out for testing).
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))

// Train a RandomForest model.
val rf = new RandomForestClassifier()
  .setLabelCol("indexedLabel")
  .setFeaturesCol("indexedFeatures")
  .setNumTrees(10)

// Convert indexed labels back to original labels.
val labelConverter = new IndexToString()
  .setInputCol("prediction")
  .setOutputCol("predictedLabel")
  .setLabels(labelIndexer.labelsArray(0))

// Chain indexers and forest in a Pipeline.
val pipeline = new Pipeline()
  .setStages(Array(labelIndexer, featureIndexer, rf, labelConverter))

// Train model. This also runs the indexers.
val model = pipeline.fit(trainingData)

// Make predictions.
val predictions = model.transform(testData)

// Select example rows to display.
predictions.select("predictedLabel", "label", "features").show(5)

// Select (prediction, true label) and compute test error.
val evaluator = new MulticlassClassificationEvaluator()
  .setLabelCol("indexedLabel")
  .setPredictionCol("prediction")
  .setMetricName("accuracy")
val accuracy = evaluator.evaluate(predictions)
println(s"Test Error = ${(1.0 - accuracy)}")

val rfModel = model.stages(2).asInstanceOf[RandomForestClassificationModel]
println(s"Learned classification forest model:\n ${rfModel.toDebugString}")

MLflow API

https://mlflow.org/docs/latest/python_api/mlflow.pyspark.ml.html
- pyspark.ml.classification.RandomForestClassificationModel
https://mlflow.org/docs/latest/python_api/mlflow.spark.html

Antlr Codegen should not be keeped in the git repo

Christopher Neugebauer:

Just revisiting this thread — here’s a demo of using ANTLR entirely without writing a plugin, that works against mainline Pants (2.16, I guess): https://github.com/pantsbuild/example-python/compare/main...chrisjrn:pantsbuild-example-python:chrisjrn/codegen_with_antlr?expand=1#diff-1[…]4c1ff2f

Model Type Design: rate limit or billing limit

eto-ai/rikai#578

The liga-aws plugins needs this functionality.

Show process id in the log message

Execution in Python -> JVM -> Python is too complicated, it would be better to display the process id.

Question

How to extract the audio from the Video?
How to find the break point of the audio?
How to find the break point of the video?
How to extract the text from the Video?

DSL Design

Github Action CI for Linting and Checking

Vision: Abstract Image Classifier Model Type

Vision: Abstract Model Type for Object Detection

Publish Scala Artifacts to Github Packages

https://dev.to/gjuoun/publish-your-scala-library-to-github-packages-4p80

ModelType Design: implicit schema pruning

eto-ai/rikai#558

Apply SpectralClustering on datasets

Problem

https://scikit-learn.org/1.1/modules/generated/sklearn.cluster.SpectralClustering.html#sklearn.cluster.SpectralClustering

There are no predict or transform in SpectralClustering. And no model will be logged in to MLflow.

How can we apply these kind of model on the dataset?

Here is the code snippet to use SpectralClustering:

>>> from sklearn.cluster import SpectralClustering
>>> import numpy as np
>>> X = np.array([[1, 1], [2, 1], [1, 0],
...               [4, 7], [3, 5], [3, 6]])
>>> clustering = SpectralClustering(n_clusters=2,
...         assign_labels='discretize',
...         random_state=0).fit(X)
>>> clustering.labels_
array([1, 1, 1, 0, 0, 0])
>>> clustering
SpectralClustering(assign_labels='discretize', n_clusters=2,
    random_state=0)

Analysis

Case 1: small dataset

Take sklearn SpectralCluster for example, it is not a distributed ML model.

For Large Scaled Data which could not be loaded into one executor, SpectralCluster should not be applied.

Case 2: big dataset

https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.clustering.PowerIterationClustering.html#pyspark.ml.clustering.PowerIterationClustering

Liga: the ML-Enhanced Spark SQL

Design

Liga is general-purpose ML-enhanced SQL framework designed to be modular, extensible and scalable.

Spark SQL and MLflow: Currently, Liga depends on Spark SQL. It does not mean Liga will always be a Apache Spark based project. Just like that MLflow registry is an optional registry in Liga, Spark SQL should and could also be an option.
For Prediction but not for Training: Thanks to the ML Model Registries, the training and the prediction can be separated. Liga is designed to apply ML Models via SQL syntax on datasets but not designed to train ML models via SQL syntax for now. Let us focus on applying and also serving ML models first!
General Purpose but not Domain Specific: Integration with a specific model, a specific ML framework or a specific domain like vision or audio should not be maintained in this repository. They should be maintained in separate projects depending on Liga. eg. A separate project `liga-vision` for integrating the computer vision related UDF and UDT (eg. Image, Box2d) or a separeate project `liga-pytorch` for integrating the PyTorch framework. Scikit-learn integration and Spark ML integration are the reference implementation. That's why they are maintained in Liga.

Live Notebooks

Latest Notebook: Latest Notebook is the in-repo Jupyter notebook, it depends on the latest code in this repo. You need to clone this repo and then launch the Jupyter Notebooks locally. You can also click the link below to preview it on Github.
Google Colab Notebook: Google Colab Notebook is the notebook depends on the latest stable release of Liga. The link below helps you open the notebook under the latest stable release of Liga in Google Colab. You can try the live Google Colab notebooks just using a web browser.

Try the latest notebooks:

git clone https://github.com/liga-ai/liga.git
bin/lab sklearn

Preview the notebooks on Github or Try the live notebook on Google Colab:

Model	Model Type	Official Documentation	Preview Latest Notebook
LinearRegression	regressor	Linear Models Ordinary Least Squares	Demo
LogisticRegression	classifier	Logistic regression	Demo
Ridge	regressor	Ridge regression and classification	Demo
RidgeClassifier	classifier		Demo
SVC/NuSVC/LinearSVC	classifier	Support Vector Machines	Demo
SVR/NuSVR/LinearSVR	regressor		Demo
RandomForestClassifier	classifier	Forests of randomized trees	Demo
RandomForestRegressor	regressor		Demo
ExtraTreesClassifier	classifier		Demo
ExtraTreesRegressor	regressor		Demo
KMeans	cluster	Clustering	Demo
PCA	transformer	Decomposing signals in components (matrix factorization problems)	Demo

Liga SQL References

SQL: `ML_PREDICT` for small models

SELECT
  id,
  ML_PREDICT(my_yolov5, image)
FROM cocodataset

ML_PREDICT is a special UDF which takes two parameters:

model_name is a special parameter look likes an identifier
data

SQL: `ML_TRANSFORM` for big models

TODO (see #9 )

SQL: Model Creation

A Model instance is created by specifying the model flavor, type and options on the uri.

-- Create model
CREATE [OR REPLACE] MODEL model_name
[FLAVOR flavor]
[MODEL_TYPE model_type]
[OPTIONS (key1=value1,key2=value2,...)]
USING "uri";

flavor: eg. liga.sklearn => from liga.sklearn.codegen import codegen
model_type: eg. classifier => from liga.sklearn.models.classifier import MODEL_TYPE

SQL: Model Catalog

-- Describe model
{ DESC | DESCRIBE } MODEL model_name;

-- Show all models
SHOW MODELS;

-- Delete a model
DROP MODEL model_name;

Python API

Model Type

A Model Type encaptures the interface and schema of a concrete ML model. It acts as an adaptor between the raw ML model input/output Tensors and Spark / Pandas.

Here is the key code snippet of the sklearn classifier model type (liga.sklearn.models.classifier):

class Classifier(SklearnModelType):
    """Classification model type"""

    def schema(self) -> str:
        return "int"

    def predict(self, *args: Any, **kwargs: Any) -> List[int]:
        assert self.model is not None
        assert len(args) == 1
        return self.model.predict(args[0]).tolist()

Model Flavor

A Flavor describes the framework upon which the model was built.

A Liga model flavor should provide:

generate_udf: to construct a Pandas UDF to run flavor-specific models. The special UDF `ML_PREDICT` will be translated into the generated pandas udf per flavor.
load_model_from_uri: to load models from filesystem URI for `FileSystemRegistry`. Because there are different ways to load a model from a filesystem URI for different ML frameworks. Model Registries like MLflow unify the way to load a model from the registry. That's why for those model registries, a URI (eg. `mlflow:///yolov5`) is sufficient.

Supported flavors:

sklearn (provided by liga-sklearn)
pytorch (provided by liga-pytorch)
...

Model Registry

A model registry specifies where and how to load a model.

Name	Pypi	URI
DummyRegistry	`liga`		A special registry without URI provided. How and where to load model is hard-coded in model types, eg. `torchvision.models.resnet50()`.
FileSystemRegistry	`liga`	`http:///`,`file:///`,`s3:///`,...
MLflowRegistry	`liga-mlflow`	`mlflow:///`	MLflowRegistry is the recommended production-ready model registry.

Model Catalog

Currently, only a in-memory model catalog is available in Liga. Via Model Catalog, ML enhanced-SQL users only needs focus on how to apply ML-enhanced SQL on datasets at scale. Models are carefully maintained by Data/ML Engineers or Data Scientists.

WARNING: Python API to customize the Model Catalog is not yet provided!

History

Liga is the ML-enhanced SQL part of Rikai. Rikai is created by @changhiskhan and @eddyxu and the first release of Rikai dates back to 2021/04/04. @da-tubi and @Renkai created the Liga fork of Rikai as a project of the 4th Tubi Hackathon (#4).

Doc: deployment on Databricks 12.2 LTS

Vision: Abstract Model Type for OCR

Adopt pylint

Liga: the ML-enhanced SQL part of Rikai (The Hackathon)

Hackathon

by @da-tubi and @Renkai

Here is the result of the hackathon:

pypi: https://pypi.org/project/liga/0.2.0.dev3/
demo usage: https://github.com/liga-ai/liga-example
github release: https://github.com/liga-ai/liga/releases/tag/v0.2.0.dev3
milestone: https://github.com/liga-ai/liga/milestone/1

Liga v.s. Rikai

Liga v.s. Rikai (Linguistic)

The pronounciation of Liga in Wuu Chinese and Rikai in Japanese is almost the same.
The meaning of Liga or Rikai is understanding in English or 理解 in Chinese.

Liga v.s. Rikai (Demo Notebook)

Liga v.s. Rikai (Software Engineering)

Rikai is too complicated to maintain and it is dedicated for computer vision. Liga is designed to modularize Rikai, we only need the ML-enhanced SQL part of Rikai:

the ML_PREDICT magic syntax for Spark SQL (also ML_TRANSFORM , ML_FORECAST )
MLflow integration
ModelType design

- Rikai: ML_PREDICT/ModelType/MLflow/PyTorch/Computer Vision/Rikai format
+ Liga: ML_PREDICT/ModelType/MLflow/Sklearn

Liga v.s. Rikai (Pypi)

rikai

rikai 0.1.15

liga

liga 0.2.0dev3 0.2.0
- liga-mlflow 0.2.0
- liga-sklearn 0.2.0
- liga-torch
- liga-vision
  - liga-torchvision

Liga

Using the Pants build tool

Enable Python Linter and Checker

Before

# no linter and cheker

After

bin/lint
bin/check

How to test

Before

sbt publishLocal

cd python
# create a python virtualenv
pip install -e . # pip install -e .[all] to install all optional extras (see "Install from pypi")

Now

sbt assembly

bin/test

Launch the Jupyter Lab

Before

sbt publishLocal

cd python
# create a python virtualenv
pip install -e . # pip install -e .[all] to install all optional extras (see "Install from pypi")
jupyterlab

Now

sbt assembly

bin/lab sklearn

Plans

The Liga project is started as a hackathon project at Tubi. Here is the Plans:

Fork Rikai to Liga and remove the not-used code
Integrate liga-sklearn to verify the usability of Liga
#17

liga 0.2.0
liga-sklearn 0.2.0: the built-in and demo integration between Liga and Sklearn
- liga
liga-mlflow 0.2.0: the integration between Liga and MLflow
- liga

UDF
UDT
DSL to manipulate Audio (Jupyter Integration)

Extensible Registeries

Here is the list of hard coded registeries:

MlflowRegistry
FileSystemRegistry
DummyRegistry
TFHubRegistry
TorchHubRegistry

Make it extensible, so that we do not need to code in Scala and Python to add a new registery! Just Python needed.

liga-vision: the computer vision support for Liga

UDF
UDT
DSL to manipulate Image/Video

We have to discuss if we should maintain liga-vision in this repo.

Jupyter Notebook and Google Colab

Improve the Jupyter Notebooks
Append the Google Colab link

Set groupId to net.xmacs.liga #91
A Sonatype Account
A JIRA Ticket: https://issues.sonatype.org/browse/OSSRH-88777
sbt clean publishSigned sonatypeBundleRelease
Local: configure credentials in $HOME/.sbt/1.0/sonatype.sbt

Ray support

Import unit tests from rikai

In #2, we removed failed unit tests directly. Some of them are important unit tests:

komprenilo / liga Goto Github PK

liga's People

Contributors

Stargazers

Watchers

Forkers

liga's Issues

Conclusion

Two Previous Attempts by Renkai

Let us try again

Demo Code: RandomForestClassifier

MLflow API

Question

DSL Design

Problem

Analysis

Case 1: small dataset

Case 2: big dataset

Liga: the ML-Enhanced Spark SQL

Design

Live Notebooks

Liga SQL References

SQL: ML_PREDICT for small models

SQL: ML_TRANSFORM for big models

SQL: Model Creation

SQL: Model Catalog

Python API

Model Type

Model Flavor

Model Registry

Model Catalog

History

Hackathon

Liga v.s. Rikai

Liga v.s. Rikai (Linguistic)

Liga v.s. Rikai (Demo Notebook)

Liga v.s. Rikai (Software Engineering)

Liga v.s. Rikai (Pypi)

rikai

liga

Liga

Using the Pants build tool

Enable Python Linter and Checker

How to test

Launch the Jupyter Lab

Plans

liga/mlflow

liga/registry

liga/sklearn

Recommend Projects

Recommend Topics

Recommend Org

SQL: `ML_PREDICT` for small models

SQL: `ML_TRANSFORM` for big models

`liga/mlflow`

`liga/registry`

`liga/sklearn`