combust / mleap Goto Github PK
View Code? Open in Web Editor NEWMLeap: Deploy ML Pipelines to Production
Home Page: https://combust.github.io/mleap-docs/
License: Apache License 2.0
MLeap: Deploy ML Pipelines to Production
Home Page: https://combust.github.io/mleap-docs/
License: Apache License 2.0
We should start supporting backwards to at least Spark 2.0 with the use of bundle registry config files specific to each version.
I'm playing https://github.com/combust/mleap/wiki/Serializing-a-Spark-ML-Pipeline-and-Scoring-with-MLeap#serialize-the-ml-data-pipeline-and-rf-model-to-bundleml and I struggled to realize that one should read featureModel instead of featurePipeline in
val pipeline = SparkUtil.createPipelineModel(uid = "pipeline", Array(featurePipeline, rf))
I could'nt find another way to contribute to the wiki (like PR) - feel free to tell me how you wish to receive such contribs.
How can I integrate custom spark transformers and estimators into mleap?
I am thinking of preprocessing steps, nan cleaning, ...
Hi,
I just tried mleap, which is really awesome. However, I was wondering is there a way to get the schema of the exported model in a format such as PMML? So that we can have a better overview of what types of features the model is using and information like that?
Thanks
This should be implemented using Bundle.ML custom types.
Add in support for Spark's MultiLayerPerceptron
Coalesce transformer takes in multiple columns and chooses the first non-null value. Supports only doubles and nullable doubles.
StringMap takes in a string and outputs a double using a user-defined map.
Spark support for these two transformers will come with a later ticket.
Build out a Python module to serialize Scikit-learn + Pandas transformer pipelines to MLeap. We do not need to support deserialization to start.
Support LDA clustering algorithm
This epic is to track the progress towards full Spark support. It will not include transformers that require multiple data frames (recommendation algorithm) and it will not include LDA, which is a rather large undertaking to get all the code into place for it.
Include an optional schema.json file in the root bundle. Only include this if there is enough information to accurately describe the input and output schemas.
For Spark-trained pipelines, we will have to include the DataFrame used to train the pipeline while we serialize the model. SparkBundleContext already has an optional DataFrame for this purpose.
Add a transformer to a new module called "mleap-tensorflow" that allows passing data into a tensorflow graph for transformation.
This epic is for Spark transformers that are rather tricky for one reason or another to adapt to MLeap.
This usually is because multiple data frames may be involved in the transform process. MLeap will have to come up with a solution to this as we move forward.
Currently working with MLeap from Java can be a pain. Let's make the interface nicer.
Allow users to store arbitrary meta data in the bundle file.
This can be useful for:
Spark has a nice feature that let's you build a Dataset from a case class. We should support this as well.
MleapReflection provides many tools that will be needed for this task.
I am thinking we should support the following conversions:
These conversions should be implicit and included in the MleapSupport trait for easy usage.
We should add imputer to MLeap based on the Spark transformer.
Like other scala based projects the release jar should have scala version in jar name or at least in the documentation.
I have started experimenting with serializing and deserializing spark pipelines. However, I have noticed that when I override default params, they are missing after deserialization. I have narrowed down the cause to how the OpNode#load
method is implemented, and specifically the use of .copy(model.extractParamMap())
.
I am not very familiar with this copy
API provided by spark, so cannot figure out if this is a Spark bug or if this is misuse of the API. So the only solution I've thought so far is just to explicitly get and set each param (as done in OpModel#load
).
Here is a reproducible case, using Binarizer
as an example:
import org.apache.spark.ml.feature.Binarizer
import ml.combust.mleap.spark.SparkSupport._
val bin = new Binarizer("bin")
.setInputCol("in")
.setOutputCol("out")
.setThreshold(0.5)
val path = new File(...)
bin.serializeToBundle(path)
val bin2 = path.deserializeBundle()._2.asInstanceOf[Binarizer]
assert(bin.getInputCol == bin2.getInputCol)
assert(bin.getOutputCol == bin2.getOutputCol)
assert(bin.getThreshold == bin2.getThreshold) //fails
Right now we only support protobuf serialization of decision trees. Let's offer JSON serialization as well for when JSON-only serialization is used.
When I deserialize the bundle model (a simple Random Forest Model) from jar zip file, like
val bundle = BundleFile("jar:file:/home//userA/rf.zip").load().get
I got a bundle of type ml.combust.bundle.dsl.Bundle[Nothing],
and then when I access root, I got following exceptions:
java.lang.ClassCastException: ml.combust.mleap.runtime.transformer.Pipeline cannot be cast to scala.runtime.Nothing$
Thanks for help in advance!
Spark added support for multinomial logistic regression, we should do the same
This will be an extension to Spark, but should be builtin to MLeap.
Dear sir or Madam,
When I try to save the model using:
val sbc = SparkBundleContext().withDataset(pipeline.transform(df))
there is following exception:
com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'ml'
I am using iheart/ficus for configuration setting, which is built on top of the typesafe.config library.
So in the beginning of my program, i have code like:
val conf = ConfigFactory.load()
val settings = new Settings(conf)
which will read configuration from reference.conf and application.conf.
When I test the code in spark-shell, without using typesafe.config library, the code works.
How can I continue to use typesafe.config on my own without breaking MLeap?
Thanks.
Requires JDK 8
Requires Scala 2.11
So that people can evaluate things faster to see whether this works with their tech stack or not.
Use Option monad for null values.
Data types should be optionally nullable.
Need to make sure that our MLeap transformers work in complete parity with Spark transformers. Use our Spark integration to do this.
Currently there is a large array of serialization formats for machine learning models:
We propose a serialization format that is highly-extensible, portable across language and platforms, open-source and with a reference implementation in both Scala and Rust. We call this serialization format Bundle.ML.
custom transformers
in Scala, Java, Python, C, Rust, or any other languagemleap seems to support RandomForestClassifier
. What about xgboost
, especially xgboost4j?
https://github.com/komiya-atsushi/xgboost-predictor-benchmark provides a great jvm based and very quick evaluator. Maybe this would be helpful for a xgboost-mleap integration.
We want to use NIO FileSystem objects to serialize Bundle.ML, this will make it much more versatile and simplify the code a great deal. Also, some small tweaks to how we serialize Bundle.ML root-level components.
Hi,
Just want to say this project is pretty cool and thanks for your effort!
We are looking for some solution to train models offline using Spark, yet score online in real time. This is exactly what we need.
I was trying to follow the minimal doc in the wiki page. https://github.com/combust-ml/mleap/wiki/Setting-up-a-Spark-2.0-notebook-with-MLeap-an-Toree
The page seems not finished yet. Any unit test class of example that I can follow to use MLeap?
Also, a couple of corrections.
Again. Thanks!
Implement these transformers for Spark
This will require creating a converter form MLeap UDFs to Spark UDFs
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.