komprenilo / liga Goto Github PK
View Code? Open in Web Editor NEWLiga: Let Data Dance with ML Models
License: Apache License 2.0
Liga: Let Data Dance with ML Models
License: Apache License 2.0
The Spark ML new flavor requires ML_TRANSFORM
but not ML_PREDICT
.
The difference between ML_TRANSFORM
and ML_PREDICT
is the size of the model.
ML_PREDICT
is implemented using PySpark pandas_udf, it works well with small models which can be loaded in one node. ML_TRANSFORM
is for big models which can not be loaded in one node.
eto-ai/rikai#338
Tried to implement ML_PREDICT for SparkML like @da-tubi did for eto-ai/rikai#326 , but it's much more complex than I thought, maybe the best way to complete it is to implement ML_PREDICT UDF for SparkML in Scala, so the worker will not need SparkContext to get a proper set JVM.
However, it's independent with this issue, we can still implement training by SparkML feature, just can't use ML_PREDICT for SparkML.
eto-ai/rikai#343
Another try to implement ML_PREDICT for SparkMl failed, though I already try it in scala, the key issue that caused the failure is SparkML model can only deal with Dataset, which is not attachable in UDF, we need to replace ML_PREDICT to driver side code generator not only another UDF.
https://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-classifier
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.{RandomForestClassificationModel, RandomForestClassifier}
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer}
// Load and parse the data file, converting it to a DataFrame.
val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
// Index labels, adding metadata to the label column.
// Fit on whole dataset to include all labels in index.
val labelIndexer = new StringIndexer()
.setInputCol("label")
.setOutputCol("indexedLabel")
.fit(data)
// Automatically identify categorical features, and index them.
// Set maxCategories so features with > 4 distinct values are treated as continuous.
val featureIndexer = new VectorIndexer()
.setInputCol("features")
.setOutputCol("indexedFeatures")
.setMaxCategories(4)
.fit(data)
// Split the data into training and test sets (30% held out for testing).
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))
// Train a RandomForest model.
val rf = new RandomForestClassifier()
.setLabelCol("indexedLabel")
.setFeaturesCol("indexedFeatures")
.setNumTrees(10)
// Convert indexed labels back to original labels.
val labelConverter = new IndexToString()
.setInputCol("prediction")
.setOutputCol("predictedLabel")
.setLabels(labelIndexer.labelsArray(0))
// Chain indexers and forest in a Pipeline.
val pipeline = new Pipeline()
.setStages(Array(labelIndexer, featureIndexer, rf, labelConverter))
// Train model. This also runs the indexers.
val model = pipeline.fit(trainingData)
// Make predictions.
val predictions = model.transform(testData)
// Select example rows to display.
predictions.select("predictedLabel", "label", "features").show(5)
// Select (prediction, true label) and compute test error.
val evaluator = new MulticlassClassificationEvaluator()
.setLabelCol("indexedLabel")
.setPredictionCol("prediction")
.setMetricName("accuracy")
val accuracy = evaluator.evaluate(predictions)
println(s"Test Error = ${(1.0 - accuracy)}")
val rfModel = model.stages(2).asInstanceOf[RandomForestClassificationModel]
println(s"Learned classification forest model:\n ${rfModel.toDebugString}")
Christopher Neugebauer:
Just revisiting this thread — here’s a demo of using ANTLR entirely without writing a plugin, that works against mainline Pants (2.16, I guess): https://github.com/pantsbuild/example-python/compare/main...chrisjrn:pantsbuild-example-python:chrisjrn/codegen_with_antlr?expand=1#diff-1[…]4c1ff2f
The liga-aws
plugins needs this functionality.
Execution in Python -> JVM -> Python
is too complicated, it would be better to display the process id.
pytorch
There are no predict
or transform
in SpectralClustering. And no model will be logged in to MLflow.
How can we apply these kind of model on the dataset?
Here is the code snippet to use SpectralClustering:
>>> from sklearn.cluster import SpectralClustering
>>> import numpy as np
>>> X = np.array([[1, 1], [2, 1], [1, 0],
... [4, 7], [3, 5], [3, 6]])
>>> clustering = SpectralClustering(n_clusters=2,
... assign_labels='discretize',
... random_state=0).fit(X)
>>> clustering.labels_
array([1, 1, 1, 0, 0, 0])
>>> clustering
SpectralClustering(assign_labels='discretize', n_clusters=2,
random_state=0)
Take sklearn SpectralCluster for example, it is not a distributed ML model.
For Large Scaled Data which could not be loaded into one executor, SpectralCluster should not be applied.
Liga is general-purpose ML-enhanced SQL framework designed to be modular, extensible and scalable.
Try the latest notebooks:
git clone https://github.com/liga-ai/liga.git
bin/lab sklearn
Preview the notebooks on Github or Try the live notebook on Google Colab:
Model | Model Type | Official Documentation | Preview Latest Notebook | Try Google Colab Notebook |
---|---|---|---|---|
LinearRegression | regressor | Linear Models Ordinary Least Squares | Demo | |
LogisticRegression | classifier | Logistic regression | Demo | |
Ridge | regressor | Ridge regression and classification | Demo | |
RidgeClassifier | classifier | Demo | ||
SVC/NuSVC/LinearSVC | classifier | Support Vector Machines | Demo | |
SVR/NuSVR/LinearSVR | regressor | Demo | ||
RandomForestClassifier | classifier | Forests of randomized trees | Demo | |
RandomForestRegressor | regressor | Demo | ||
ExtraTreesClassifier | classifier | Demo | ||
ExtraTreesRegressor | regressor | Demo | ||
KMeans | cluster | Clustering | Demo | |
PCA | transformer | Decomposing signals in components (matrix factorization problems) | Demo |
ML_PREDICT
for small modelsSELECT
id,
ML_PREDICT(my_yolov5, image)
FROM cocodataset
ML_PREDICT
is a special UDF which takes two parameters:
model_name
is a special parameter look likes an identifierML_TRANSFORM
for big modelsTODO (see #9 )
A Model instance is created by specifying the model flavor, type and options on the uri.
-- Create model
CREATE [OR REPLACE] MODEL model_name
[FLAVOR flavor]
[MODEL_TYPE model_type]
[OPTIONS (key1=value1,key2=value2,...)]
USING "uri";
flavor
: eg. liga.sklearn
=> from liga.sklearn.codegen import codegen
model_type
: eg. classifier
=> from liga.sklearn.models.classifier import MODEL_TYPE
-- Describe model
{ DESC | DESCRIBE } MODEL model_name;
-- Show all models
SHOW MODELS;
-- Delete a model
DROP MODEL model_name;
A Model Type encaptures the interface and schema of a concrete ML model. It acts as an adaptor between the raw ML model input/output Tensors and Spark / Pandas.
Here is the key code snippet of the sklearn classifier
model type (liga.sklearn.models.classifier
):
class Classifier(SklearnModelType):
"""Classification model type"""
def schema(self) -> str:
return "int"
def predict(self, *args: Any, **kwargs: Any) -> List[int]:
assert self.model is not None
assert len(args) == 1
return self.model.predict(args[0]).tolist()
A Flavor describes the framework upon which the model was built.
A Liga model flavor should provide:
Supported flavors:
liga-sklearn
)liga-pytorch
)A model registry specifies where and how to load a model.
Name | Pypi | URI | |
---|---|---|---|
DummyRegistry | liga |
A special registry without URI provided. How and where to load model is hard-coded in model types, eg. torchvision.models.resnet50() . |
|
FileSystemRegistry | liga |
http:/// ,file:/// ,s3:/// ,... |
|
MLflowRegistry | liga-mlflow |
mlflow:/// |
MLflowRegistry is the recommended production-ready model registry. |
Currently, only a in-memory model catalog is available in Liga. Via Model Catalog, ML enhanced-SQL users only needs focus on how to apply ML-enhanced SQL on datasets at scale. Models are carefully maintained by Data/ML Engineers or Data Scientists.
WARNING: Python API to customize the Model Catalog is not yet provided!
Liga is the ML-enhanced SQL part of Rikai. Rikai is created by @changhiskhan and @eddyxu and the first release of Rikai dates back to 2021/04/04. @da-tubi and @Renkai created the Liga fork of Rikai as a project of the 4th Tubi Hackathon (#4).
Here is the result of the hackathon:
The pronounciation of Liga in Wuu Chinese and Rikai in Japanese is almost the same.
The meaning of Liga or Rikai is understanding in English or 理解
in Chinese.
Rikai is too complicated to maintain and it is dedicated for computer vision. Liga is designed to modularize Rikai, we only need the ML-enhanced SQL part of Rikai:
- Rikai: ML_PREDICT/ModelType/MLflow/PyTorch/Computer Vision/Rikai format
+ Liga: ML_PREDICT/ModelType/MLflow/Sklearn
0.1.15
0.2.0
0.2.0
0.2.0
liga-torch
liga-vision
liga-torchvision
Before
# no linter and cheker
After
bin/lint
bin/check
Before
sbt publishLocal
cd python
# create a python virtualenv
pip install -e . # pip install -e .[all] to install all optional extras (see "Install from pypi")
Now
sbt assembly
bin/test
Before
sbt publishLocal
cd python
# create a python virtualenv
pip install -e . # pip install -e .[all] to install all optional extras (see "Install from pypi")
jupyterlab
Now
sbt assembly
bin/lab sklearn
The Liga project is started as a hackathon project at Tubi. Here is the Plans:
Here is the list of hard coded registeries:
Make it extensible, so that we do not need to code in Scala and Python to add a new registery! Just Python needed.
We have to discuss if we should maintain liga-vision in this repo.
NumPy Input Support in PySpark
net.xmacs.liga
#91sbt clean publishSigned sonatypeBundleRelease
$HOME/.sbt/1.0/sonatype.sbt
In #2, we removed failed unit tests directly. Some of them are important unit tests:
liga/mlflow
python/liga/mlflow
python/test/liga/mlflow
liga/registry
python/liga/registry
mypy
and pylint
for liga/registry
liga/sklearn
Model Flavor is derived from MLflow.
Neptune: https://docs.neptune.ai/integrations/sklearn/#scikit-learn-logging-example
Comet: https://www.comet.com/docs/v2/integrations/ml-frameworks/scikit-learn/
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.