Giter VIP home page Giter VIP logo

Comments (10)

drei34 avatar drei34 commented on August 12, 2024

I am using the above and did pip install sparkxgb and now it seems like I'm having some success but deserializing is not working. I ran the code below and it seems OK but I wonder why would serialization work but deserialization not.

Error: Py4JJavaError: An error occurred while calling o467.deserializeFromBundle. : java.lang.NoClassDefFoundError: resource/package$ at ml.dmlc.xgboost4j.scala.spark.mleap.XGBoostClassificationModelOp$$anon$1.load(XGBoostClassificationModelOp.scala:48) at ml.dmlc.xgboost4j.scala.spark.mleap.XGBoostClassificationModelOp$$anon$1.load(XGBoostClassificationModelOp.scala:20) at ml.combust.bundle.serializer.ModelSerializer.$anonfun$readWithModel$2(ModelSerializer.scala:106) at scala.util.Success.$anonfun$map$1(Try.scala:255) at scala.util.Success.map(Try.scala:213)

import mleap.pyspark
from mleap.pyspark.spark_support import SimpleSparkSerializer
import pyspark
import pyspark.ml
from pyspark.sql import SparkSession
from pyspark.context import SparkContext
from pyspark.sql import HiveContext

import sparkxgb
from sparkxgb import XGBoostClassifier

sc = SparkContext()
sqlContext = HiveContext(sc)

spark = SparkSession.builder.config(
    "spark.jars",
    "/usr/lib/spark/jars/xgboost4j-1.7.3.jar,/usr/lib/spark/jars/mleap-xgboost-runtime_2.12-0.22.0.jar,/usr/lib/spark/jars/mleap-xgboost-spark_2.12-0.22.0.jar,/usr/lib/spark/jars/bundle-ml_2.12-0.22.0.jar"
).enableHiveSupport().getOrCreate()

spark.sparkContext.addPyFile("/usr/lib/spark/jars/pyspark-xgboost.zip")

features = ['dayOfWeek','hour','channel','platform','deviceType','adUnit','pageType','zip','advertiserId']
df = spark.createDataFrame([(7,20,3,6,1,10,3,53948,245351,1),(7,20,3,6,1,10,3,53948,245351,1)],    features + ['label'])

xgboost = sparkxgb.XGBoostClassifier(
                    featuresCol="features",
                    labelCol="label")


pipeline = pyspark.ml.Pipeline(
    stages=[
        pyspark.ml.feature.VectorAssembler(inputCols=features, outputCol="features"),
        xgboost
    ]
)

model = pipeline.fit(df)
predictions = model.transform(df)

local_path = "jar:file:/tmp/pyspark.example.zip"
model.serializeToBundle(local_path, predictions)
deserialized_model = pyspark.ml.PipelineModel.deserializeFromBundle(local_path)
deserialized_model.transform(df).show()

sc.stop()

from mleap.

jsleight avatar jsleight commented on August 12, 2024

hmmm, I'm not too familiar with sparkxgb package. Is there a reason you aren't using the built in pyspark bindings for xgboost instead? They were just added in v1.7 so probably a lot of online guides still refer to other things.

idk if that would impact the error being generated though. It seems to be complaining an mleap class not being present in the jvm.

from mleap.

drei34 avatar drei34 commented on August 12, 2024

Oh I see. Can this model be used inside of a bundle (think I got an error but I could be wrong)? I.e. I should be able to use it as above and serialize / deserialize?

from mleap.

drei34 avatar drei34 commented on August 12, 2024

I mean, if I have a pipeline with such a model in it, I get an error as below:

AttributeError: 'SparkXGBClassifierModel' object has no attribute '_to_java'

from mleap.

jsleight avatar jsleight commented on August 12, 2024

hmmm @WeichenXu123 any thoughts on this since I know you added the pyspark bindings in xgboost.

from mleap.

drei34 avatar drei34 commented on August 12, 2024

To be clear this is the code I ran.

`
import mleap.pyspark
from mleap.pyspark.spark_support import SimpleSparkSerializer
import pyspark
import pyspark.ml
from pyspark.sql import SparkSession
from pyspark.context import SparkContext
from pyspark.sql import HiveContext
from xgboost.spark import SparkXGBClassifier

sc = SparkContext()
sqlContext = HiveContext(sc)
spark = SparkSession.builder.enableHiveSupport().getOrCreate()

features = ['dayOfWeek','hour','channel','platform','deviceType','adUnit','pageType','zip','advertiserId']
df = spark.createDataFrame([(7,20,3,6,1,10,3,53948,245351,1),(7,20,3,6,1,10,3,53948,245351,1)], features + ['label'])

xgboost = SparkXGBClassifier(
features_col="features",
label_col="label",
num_workers=2
)

pipeline = pyspark.ml.Pipeline(
stages=[
pyspark.ml.feature.VectorAssembler(inputCols=features, outputCol="features"),
xgboost
]
)

model = pipeline.fit(df)
predictions = model.transform(df)

local_path = "jar:file:/tmp/pyspark.example.zip"
model.serializeToBundle(local_path, predictions)
deserialized_model = pyspark.ml.PipelineModel.deserializeFromBundle(local_path)

deserialized_model.stages[-1].set(deserialized_model.stages[-1].missing, 0.0)
deserialized_model.transform(df).show()

sc.stop()

`

from mleap.

WeichenXu123 avatar WeichenXu123 commented on August 12, 2024

Could you try mleap 0.20 version ? I remember similar issue happens since version > 0.20

from mleap.

drei34 avatar drei34 commented on August 12, 2024

Thanks so much for this. Give me a few days to try this and circle back. We actually got it working on 0.22 after changing some random files (wrapper.py in pyspark code). But I'd be interested to see if this works ...

from mleap.

WeichenXu123 avatar WeichenXu123 commented on August 12, 2024

Thanks so much for this. Give me a few days to try this and circle back. We actually got it working on 0.22 after changing some random files (wrapper.py in pyspark code). But I'd be interested to see if this works ...

Thanks for your investigation! Would you mind file a PR containing your changes that makes it work ? So that it can help us fix it.

from mleap.

drei34 avatar drei34 commented on August 12, 2024

I don't have a PR, but I can tell you some specs. Basically we needed to wrap up a bunch of jars into one and run the commands like below. This involves installing Java 11 to make 0.22 work but also we needed to change some specific things inside of spark related files. Specifically, see the changes needed to wrapper.py to make xgboost 1.7.3 work. See also the change to spark-env-sh to make the right Java be used. The jars I am using are a wrap up of the mleap jars found on maven. That sparkxgb zip is one I found online, it wraps xgboost so that it can be used in pyspark.

`

We need this new Java since the new Mleap uses Java 11

sudo apt-get -y install openjdk-11-jdk
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64

Change Java to 11 for Spark

echo "export JAVA_HOME="/usr/lib/jvm/java-11-openjdk-amd64"" >> $SPARK_HOME/conf/spark-env.sh

If you don't do this, Spark will complain about Python 3.7 and 3.8 - Fix Python to a specific one

export PYSPARK_PYTHON=/opt/conda/miniconda3/bin/python
export PYSPARK_DRIVER_PYTHON=/opt/conda/miniconda3/bin/python

Pip install packages - mleap 0.22.0 needs Java 11

pip install mleap==0.22.0

Change wrapper.py and tun off dynamic allocation; these are needed to make the imports work right plus we get some error for dynamicAllocation=true

sudo sed -i.bak 's/spark.dynamicAllocation.enabled=true/spark.dynamicAllocation.enabled=false/' $SPARK_HOME/conf/spark-defaults.conf
sudo sed -i.bak '/.replace("org.apache.spark", "pyspark")/s/$/.replace("ml.dmlc.xgboost4j.scala.spark", "sparkxgb.xgboost")/' $SPARK_HOME/python/pyspark/ml/wrapper.py

Transfer files where we need them. Don't forget to add the zip. There are 4 jars and 1 zip that need to be placed in the directory, see below

sudo gsutil cp gs://filesForDataproc2.0/jarsAndZip/* $SPARK_HOME/jars/
`

image

sparkxgb.zip

from mleap.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.