combust / mleap Goto Github PK

View Code? Open in Web Editor NEW

1.5K 1.5K 311.0 3.41 MB

MLeap: Deploy ML Pipelines to Production

Home Page: https://combust.github.io/mleap-docs/

License: Apache License 2.0

Scala 86.74% Shell 0.09% Java 1.56% Python 11.52% Makefile 0.10%

data-pipelines python scala scikit-learn spark tensorflow transformers

mleap's People

Contributors

Stargazers

Watchers

Forkers

gitter-badger rishabhbhardwaj townie pnpritchard mandeinfo anthonytruchet xiyuanhou chagge fengqian1989 benjamesbabala sachintyagi22 allensmile drewrobb ancasarb mageswaran1989 thunterdb crackcell bigr-lab pntemi saimaung optimuse rotemfogel sjjpo2002 ronanmccarthy etsangsplk xhook gm3000 desperado1992 xc35 malvarezstratio asoriano-stratio varunkrmodi djalova transferwise atamborrino kartha01 andyrao kioco neosapience fifar calinrc snowch holdenk saurabh3949 raghavendrasingh sethah jmahonin prongs ekasitk pcnfernando smorring tsudhahar mslipper movableink netanelrabinowitz yxz9087 github-allen 0x1997 tartakynov ashwanthkumar sushrutikhar marcusos xiashuijun siddhant0509 tavoaqp-zz clocklear ramyanl157 nikolayvoronchikhin massivityreport chastise darrenhaken pmehra7 nathanaelmouterde ari-vedant-jain miguelperalvo theonwosu venkat01 bwieliczko actank amywang718 viktor-evst wusamx jkbradley jmwdpk arunpamulapati nanaakwasiabayieboateng mcoira ihainan cfespinoza tubitv rafaelaussie gavinljj robin-su kargoglobal jrich8573 jonasp xclouded sawyer51 ramji-c yu1986

mleap's Issues

Backwards support to Spark 2.0, maybe Spark 1.6

We should start supporting backwards to at least Spark 2.0 with the use of bundle registry config files specific to each version.

Wiki minor fix

I'm playing https://github.com/combust/mleap/wiki/Serializing-a-Spark-ML-Pipeline-and-Scoring-with-MLeap#serialize-the-ml-data-pipeline-and-rf-model-to-bundleml and I struggled to realize that one should read featureModel instead of featurePipeline in

val pipeline = SparkUtil.createPipelineModel(uid = "pipeline", Array(featurePipeline, rf))

I could'nt find another way to contribute to the wiki (like PR) - feel free to tell me how you wish to receive such contribs.

mleap custom spark estimator

How can I integrate custom spark transformers and estimators into mleap?

I am thinking of preprocessing steps, nan cleaning, ...

I just tried mleap, which is really awesome. However, I was wondering is there a way to get the schema of the exported model in a format such as PMML? So that we can have a better overview of what types of features the model is using and information like that?

Thanks

Add in custom type support for LeapFrames

This should be implemented using Bundle.ML custom types.

Add in BisectingKMeans algorithm

MultiLayerPerceptron support

Add in support for Spark's MultiLayerPerceptron

Use MLeap UDFs instead of (Row) => Any for generating new columns. Get rid of all other column generation methods on TransformBuilder.

Implement MLeap UDFs up to 5 input fields.

Coalesce and StringMap transformers for MLeap

Coalesce transformer takes in multiple columns and chooses the first non-null value. Supports only doubles and nullable doubles.

StringMap takes in a string and outputs a double using a user-defined map.

Spark support for these two transformers will come with a later ticket.

Add in Support for CountVectorizer transformer

Initial support for scikit-learn

Build out a Python module to serialize Scikit-learn + Pandas transformer pipelines to MLeap. We do not need to support deserialization to start.

Support Bundle.ML JSON format
Support several feature transformers, regression algorithms and classifiers
Make sure decision tree serialization is working properly
Publish module to PIP when we release MLeap 0.5.0
Add documentation for Python SK Learn integration
Create some notebooks showing SK learn

LDA Transformer

Support LDA clustering algorithm

Full support for all Spark transformers (no multi-dataframe transformers)

This epic is to track the progress towards full Spark support. It will not include transformers that require multiple data frames (recommendation algorithm) and it will not include LDA, which is a rather large undertaking to get all the code into place for it.

Include an optional schema file for MLeap pipelines

Include an optional schema.json file in the root bundle. Only include this if there is enough information to accurately describe the input and output schemas.

For Spark-trained pipelines, we will have to include the DataFrame used to train the pipeline while we serialize the model. SparkBundleContext already has an optional DataFrame for this purpose.

Add support for executing TensorFlow graphs

Add a transformer to a new module called "mleap-tensorflow" that allows passing data into a tensorflow graph for transformation.

Tricky Spark transformers

This epic is for Spark transformers that are rather tricky for one reason or another to adapt to MLeap.

This usually is because multiple data frames may be involved in the transform process. MLeap will have to come up with a solution to this as we move forward.

Add support for IsotonicRegression transformer

Support out-of-the-box Spark transformers by default

Split off our extensions to Spark transformers into a new module
Support the out-of-the-box transformers by using metadata from a transformed DataFrame

Add VectorIndexerModel

Nice Java Interface

Currently working with MLeap from Java can be a pain. Let's make the interface nicer.

Allow for saving meta data into the Bundle file

Allow users to store arbitrary meta data in the bundle file.

This can be useful for:

Quick training summary statistics
Information about labels in the model that could be used later
Descriptions, notes, etc.

Support Conversion of Product classes to/from DefaultLeapFrame

Spark has a nice feature that let's you build a Dataset from a case class. We should support this as well.

MleapReflection provides many tools that will be needed for this task.
I am thinking we should support the following conversions:

(case class) -> DefaultLeapFrame w/ 1 row
Seq(case class) -> DefaultLeapFrame w/ n rows
DefaultLeapFrame -> (case class), extracts first row (throw error if more than one?)
DefaultLeapFrame -> Seq(case class), extracts all rows into a Seq of a case class

These conversions should be implicit and included in the MleapSupport trait for easy usage.

Add imputer transformer to MLeap

We should add imputer to MLeap based on the Spark transformer.

MLeap core model
MLeap runtime transformer
MLeap transformer NodeOp
Spark transformer NodeOp

Change release so that scala version is in jar name

Like other scala based projects the release jar should have scala version in jar name or at least in the documentation.

Spark transformer params missing after deserializing

I have started experimenting with serializing and deserializing spark pipelines. However, I have noticed that when I override default params, they are missing after deserialization. I have narrowed down the cause to how the OpNode#load method is implemented, and specifically the use of .copy(model.extractParamMap()).

I am not very familiar with this copy API provided by spark, so cannot figure out if this is a Spark bug or if this is misuse of the API. So the only solution I've thought so far is just to explicitly get and set each param (as done in OpModel#load).

Here is a reproducible case, using Binarizer as an example:

import org.apache.spark.ml.feature.Binarizer
import ml.combust.mleap.spark.SparkSupport._

val bin = new Binarizer("bin")
  .setInputCol("in")
  .setOutputCol("out")
  .setThreshold(0.5)

val path = new File(...)
bin.serializeToBundle(path)

val bin2 = path.deserializeBundle()._2.asInstanceOf[Binarizer]
assert(bin.getInputCol == bin2.getInputCol)
assert(bin.getOutputCol == bin2.getOutputCol)
assert(bin.getThreshold == bin2.getThreshold) //fails

Add IDFModel

Add support for recommendation ALS model

Support JSON serialization for decision trees

Right now we only support protobuf serialization of decision trees. Let's offer JSON serialization as well for when JSON-only serialization is used.

deserialize the bundle model and got Bundle[Nothing]

When I deserialize the bundle model (a simple Random Forest Model) from jar zip file, like
val bundle = BundleFile("jar:file:/home//userA/rf.zip").load().get
I got a bundle of type ml.combust.bundle.dsl.Bundle[Nothing],

and then when I access root, I got following exceptions:
java.lang.ClassCastException: ml.combust.mleap.runtime.transformer.Pipeline cannot be cast to scala.runtime.Nothing$

Thanks for help in advance!

Add WordToVectorModel

Add support for LSH transformers

Update Logistic Regression to support Multinomial Logistic Regression

Spark added support for multinomial logistic regression, we should do the same

Binary and Unary operation transformers

This will be an extension to Spark, but should be builtin to MLeap.

ConfigException$Missing on SparkBundleContext initialization

Dear sir or Madam,
When I try to save the model using:
val sbc = SparkBundleContext().withDataset(pipeline.transform(df))
there is following exception:
com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'ml'

I am using iheart/ficus for configuration setting, which is built on top of the typesafe.config library.
So in the beginning of my program, i have code like:
val conf = ConfigFactory.load()
val settings = new Settings(conf)
which will read configuration from reference.conf and application.conf.

When I test the code in spark-shell, without using typesafe.config library, the code works.

How can I continue to use typesafe.config on my own without breaking MLeap?

Thanks.

Add requirements in Getting started

Requires JDK 8
Requires Scala 2.11

So that people can evaluate things faster to see whether this works with their tech stack or not.

Support null values in MLeap

Use Option monad for null values.
Data types should be optionally nullable.

Add support for DCTModel

Add in tests for parity between MLeap and Spark transformers

Need to make sure that our MLeap transformers work in complete parity with Spark transformers. Use our Spark integration to do this.

Standardize Serialization format with Spark

Standardizing ML Pipeline Serialization

Currently there is a large array of serialization formats for machine learning models:

PMML is an XML-based format primarily targeting the JVM for executing ML models
Scikit-learn relies on Python pickling to export models
Spark has a serialization format based on Parquet and JSON
Various other libraries such as Caffe, Torch, MLDB, etc. have their own custom file formats they use to store models with

We propose a serialization format that is highly-extensible, portable across language and platforms, open-source and with a reference implementation in both Scala and Rust. We call this serialization format Bundle.ML.

Key Features

It should be easy for developers to add custom transformers in Scala, Java, Python, C, Rust, or any other language
The serialization format should be flexible and meet state-of-the-art performance requirements. This means being able to serialize arbitrarily-large random forest, linear, or neural network models.
Serialization should be optimized for ML Transformers and Pipelines as seen in Scikit-learn and Spark, but it should also support non-pipeline based frameworks such as H2O
Serialization should be accessible for all environments and platforms, including low-level languages like C, C++ and Rust
Provide a common, extensible serialization format for any technology to integrate with via custom transformers or core transformers
Serialization/Deserialization should be possible with as many technologies as possible to make the models truly portable between different platforms. ie, we should be able to train a pipeline with Scikit-learn then execute it in Spark.

Add support for ChiSqSelector

xgboost4j integration

mleap seems to support RandomForestClassifier. What about xgboost, especially xgboost4j?
https://github.com/komiya-atsushi/xgboost-predictor-benchmark provides a great jvm based and very quick evaluator. Maybe this would be helpful for a xgboost-mleap integration.

Add Naive Bayesian transformer to MLeap

Core Model
Transformer
Spark Serialization
MLeap Serialization
reference.conf files with added ops

NIO FileSystem objects for serializing
bundle.json should only include version, uid, serialization format
root-level transformer should be in a folder called root, next to bundle.json
get rid of custom attributes on the Bundle

Need to a workable example

Hi,

Just want to say this project is pretty cool and thanks for your effort!

We are looking for some solution to train models offline using Spark, yet score online in real time. This is exactly what we need.

I was trying to follow the minimal doc in the wiki page. https://github.com/combust-ml/mleap/wiki/Setting-up-a-Spark-2.0-notebook-with-MLeap-an-Toree

The page seems not finished yet. Any unit test class of example that I can follow to use MLeap?

Also, a couple of corrections.

page title should be "... MLeap and Toree"
in build & install toree, it should be "pip install toree-0.2.0.dev1.tar.gz"
also, it should be SPARK_HOME=... jupyter toree install

Again. Thanks!