Comments (10)
I am using the above and did pip install sparkxgb
and now it seems like I'm having some success but deserializing is not working. I ran the code below and it seems OK but I wonder why would serialization work but deserialization not.
Error: Py4JJavaError: An error occurred while calling o467.deserializeFromBundle. : java.lang.NoClassDefFoundError: resource/package$ at ml.dmlc.xgboost4j.scala.spark.mleap.XGBoostClassificationModelOp$$anon$1.load(XGBoostClassificationModelOp.scala:48) at ml.dmlc.xgboost4j.scala.spark.mleap.XGBoostClassificationModelOp$$anon$1.load(XGBoostClassificationModelOp.scala:20) at ml.combust.bundle.serializer.ModelSerializer.$anonfun$readWithModel$2(ModelSerializer.scala:106) at scala.util.Success.$anonfun$map$1(Try.scala:255) at scala.util.Success.map(Try.scala:213)
import mleap.pyspark
from mleap.pyspark.spark_support import SimpleSparkSerializer
import pyspark
import pyspark.ml
from pyspark.sql import SparkSession
from pyspark.context import SparkContext
from pyspark.sql import HiveContext
import sparkxgb
from sparkxgb import XGBoostClassifier
sc = SparkContext()
sqlContext = HiveContext(sc)
spark = SparkSession.builder.config(
"spark.jars",
"/usr/lib/spark/jars/xgboost4j-1.7.3.jar,/usr/lib/spark/jars/mleap-xgboost-runtime_2.12-0.22.0.jar,/usr/lib/spark/jars/mleap-xgboost-spark_2.12-0.22.0.jar,/usr/lib/spark/jars/bundle-ml_2.12-0.22.0.jar"
).enableHiveSupport().getOrCreate()
spark.sparkContext.addPyFile("/usr/lib/spark/jars/pyspark-xgboost.zip")
features = ['dayOfWeek','hour','channel','platform','deviceType','adUnit','pageType','zip','advertiserId']
df = spark.createDataFrame([(7,20,3,6,1,10,3,53948,245351,1),(7,20,3,6,1,10,3,53948,245351,1)], features + ['label'])
xgboost = sparkxgb.XGBoostClassifier(
featuresCol="features",
labelCol="label")
pipeline = pyspark.ml.Pipeline(
stages=[
pyspark.ml.feature.VectorAssembler(inputCols=features, outputCol="features"),
xgboost
]
)
model = pipeline.fit(df)
predictions = model.transform(df)
local_path = "jar:file:/tmp/pyspark.example.zip"
model.serializeToBundle(local_path, predictions)
deserialized_model = pyspark.ml.PipelineModel.deserializeFromBundle(local_path)
deserialized_model.transform(df).show()
sc.stop()
from mleap.
hmmm, I'm not too familiar with sparkxgb
package. Is there a reason you aren't using the built in pyspark bindings for xgboost instead? They were just added in v1.7 so probably a lot of online guides still refer to other things.
idk if that would impact the error being generated though. It seems to be complaining an mleap class not being present in the jvm.
from mleap.
Oh I see. Can this model be used inside of a bundle (think I got an error but I could be wrong)? I.e. I should be able to use it as above and serialize / deserialize?
from mleap.
I mean, if I have a pipeline with such a model in it, I get an error as below:
AttributeError: 'SparkXGBClassifierModel' object has no attribute '_to_java'
from mleap.
hmmm @WeichenXu123 any thoughts on this since I know you added the pyspark bindings in xgboost.
from mleap.
To be clear this is the code I ran.
`
import mleap.pyspark
from mleap.pyspark.spark_support import SimpleSparkSerializer
import pyspark
import pyspark.ml
from pyspark.sql import SparkSession
from pyspark.context import SparkContext
from pyspark.sql import HiveContext
from xgboost.spark import SparkXGBClassifier
sc = SparkContext()
sqlContext = HiveContext(sc)
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
features = ['dayOfWeek','hour','channel','platform','deviceType','adUnit','pageType','zip','advertiserId']
df = spark.createDataFrame([(7,20,3,6,1,10,3,53948,245351,1),(7,20,3,6,1,10,3,53948,245351,1)], features + ['label'])
xgboost = SparkXGBClassifier(
features_col="features",
label_col="label",
num_workers=2
)
pipeline = pyspark.ml.Pipeline(
stages=[
pyspark.ml.feature.VectorAssembler(inputCols=features, outputCol="features"),
xgboost
]
)
model = pipeline.fit(df)
predictions = model.transform(df)
local_path = "jar:file:/tmp/pyspark.example.zip"
model.serializeToBundle(local_path, predictions)
deserialized_model = pyspark.ml.PipelineModel.deserializeFromBundle(local_path)
deserialized_model.stages[-1].set(deserialized_model.stages[-1].missing, 0.0)
deserialized_model.transform(df).show()
sc.stop()
`
from mleap.
Could you try mleap 0.20 version ? I remember similar issue happens since version > 0.20
from mleap.
Thanks so much for this. Give me a few days to try this and circle back. We actually got it working on 0.22 after changing some random files (wrapper.py in pyspark code). But I'd be interested to see if this works ...
from mleap.
Thanks so much for this. Give me a few days to try this and circle back. We actually got it working on 0.22 after changing some random files (wrapper.py in pyspark code). But I'd be interested to see if this works ...
Thanks for your investigation! Would you mind file a PR containing your changes that makes it work ? So that it can help us fix it.
from mleap.
I don't have a PR, but I can tell you some specs. Basically we needed to wrap up a bunch of jars into one and run the commands like below. This involves installing Java 11 to make 0.22 work but also we needed to change some specific things inside of spark related files. Specifically, see the changes needed to wrapper.py to make xgboost 1.7.3 work. See also the change to spark-env-sh to make the right Java be used. The jars I am using are a wrap up of the mleap jars found on maven. That sparkxgb zip is one I found online, it wraps xgboost so that it can be used in pyspark.
`
We need this new Java since the new Mleap uses Java 11
sudo apt-get -y install openjdk-11-jdk
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
Change Java to 11 for Spark
echo "export JAVA_HOME="/usr/lib/jvm/java-11-openjdk-amd64"" >> $SPARK_HOME/conf/spark-env.sh
If you don't do this, Spark will complain about Python 3.7 and 3.8 - Fix Python to a specific one
export PYSPARK_PYTHON=/opt/conda/miniconda3/bin/python
export PYSPARK_DRIVER_PYTHON=/opt/conda/miniconda3/bin/python
Pip install packages - mleap 0.22.0 needs Java 11
pip install mleap==0.22.0
Change wrapper.py and tun off dynamic allocation; these are needed to make the imports work right plus we get some error for dynamicAllocation=true
sudo sed -i.bak 's/spark.dynamicAllocation.enabled=true/spark.dynamicAllocation.enabled=false/' $SPARK_HOME/conf/spark-defaults.conf
sudo sed -i.bak '/.replace("org.apache.spark", "pyspark")/s/$/.replace("ml.dmlc.xgboost4j.scala.spark", "sparkxgb.xgboost")/' $SPARK_HOME/python/pyspark/ml/wrapper.py
Transfer files where we need them. Don't forget to add the zip. There are 4 jars and 1 zip that need to be placed in the directory, see below
sudo gsutil cp gs://filesForDataproc2.0/jarsAndZip/* $SPARK_HOME/jars/
`
from mleap.
Related Issues (20)
- MLeap Transformer issue. HOT 1
- MathBinary input validation
- Mleap and python 3.8 HOT 27
- Exception in thread "Thread-4" java.lang.NoClassDefFoundError: ml/combust/bundle/HasBundleRegistry HOT 3
- Need help installing MLeap and XGBoost on databricks HOT 6
- Error while testing root project HOT 2
- org.apache.spark.sql.mleap.TypeConverters can not convert 2D tensor to Matrix
- Tensor to Proto Bug with SparseTensor: " java.lang.IllegalArgumentException: size of dimensions must equals size of values"
- org.apache.spark.ml.parity.SparkParityBase.spark should be a method
- Please release version 0.22.0 and 0.23.0 of mleap-spring-boot on dockerhub HOT 5
- Error building mleap-spring-boot HOT 7
- Mleap Spring Boot Swagger Documentation Seems Incorrect HOT 4
- Update springboot to 2.7.10 and snakeyaml to 2.0
- Reporting a security issue in MLeap HOT 1
- Question about Mleap Spring Boot API HOT 2
- How to use XGBoost PySpark API with MLeap? HOT 3
- Need Help on Contributing towards Mleap for 2.13 HOT 3
- Pyspark DecisionTreeRegressionModel bundle does not include all attributes HOT 7
- Getting Key Not Found Exception while Serializing to a mleap bundle HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mleap.