Giter VIP home page Giter VIP logo

Comments (21)

jakubhava avatar jakubhava commented on September 18, 2024

Hi Dom-nik,
it's caused by spark launching new executor after the initialisation of H2OContext.

In Sparkling Water we try to discover all spark executors at the start of H2OContext and start h2o on them. But if spark for some reason launches new executor, it does not have h2o instance running which then leads to error during computation.

So what we do in this case is to throw exception on spark topology changes and kill the cloud.
You can turn this listener off by setting spark.ext.h2o.topology.change.listener.enabled to false, but it it's still won't prevent from the problem I described earlier ( it's also explained here #4)

We are working on a new sparkling-water architecture which should solve these issues.

from sparkling-water.

Dom-nik avatar Dom-nik commented on September 18, 2024

Hi MadMan0708,

Thanks for prompt reply!
How can I set this parameter? Can it be treated as a valid workaround?

From what you are saying I get the impression that running H2O on Hadoop is a better idea than treating Sparkling Water as a H2O backend. Am I right?

from sparkling-water.

jakubhava avatar jakubhava commented on September 18, 2024

Hi Dom-nik,
you can set the property like this: spark-submit --class water.SparklingWaterDriver --conf "spark.ext.h2o.topology.change.listener.enabled=false" --master yarn-client --num-executors 3 --driver-memory 4g --executor-memory 2g --executor-cores 1 /opt/sparkling-water/sparkling-water-1.5.14/assembly/build/libs/*.jar

Regarding H2O & Sparkling Water.

It depends on your needs. If you don't use spark to do for example some feature engineering of data munging, then there is probably no reason for you to use Sparkling Water.

However if you already use spark in your existing application, then I recommend Sparkling Water. In most cases it starts fine ( we have problems on clusters with 60 and more nodes and we are working on a new solution to this problem as fast as we can).

Also there is a tuning guide, which should help you to set up Spark in order it works better with Sparkling-Water https://github.com/h2oai/sparkling-water/blob/master/DEVEL.md#SparklingWaterTuning )

Can you please try to start the H2OContext using your spark-submit one or two more times ? Does this happens all over again ?
I see that you use 3 executors so it should work perfectly ( sorry, I forgot to write this in the first reply - the problem explained there occurs generally for higher number of executors). It could be also caused by another problem - can you share your yarn and h2o logs ?

from sparkling-water.

Dom-nik avatar Dom-nik commented on September 18, 2024

Thank you for this answer, but it seems that the command you provided is exactly the same as the one I posted at the beginning :]

I'm still willing to test out your solution, but I thought I'd also give you some context: what I actually wanted to achieve was to have a single H2O instance that would serve as a backend for Python- and R-based H2O calls, something like a server for many users. I'm not sure if that's the way H2O was meant to be used. Is it?

I was also considering using JupyterHub as main GUI for end users and give them access to H2O via Python and R, instead of Flow, as it seems, there is no multi-user operation inbuilt into it.
Is there? Can you have any authentication on Flow side?

from sparkling-water.

jakubhava avatar jakubhava commented on September 18, 2024

Hi,
the command is different, please have a look on it one more time :)

H2O is perfect for what you want to achieve. You can start h2o cloud of arbitrary size and then access it using our R/Python/Java/Rest api. You can make one call via R API and another via Python api.

I'm not main flow developer, let me ask our team regarding the flow question.

from sparkling-water.

Dom-nik avatar Dom-nik commented on September 18, 2024

Hi,
I've tried the updated command and it still gives me the:
Exception in thread "main" java.lang.RuntimeException: Cloud size under 3 after some time and I'm not able to view the GUI at 54321 port. Do you have any other suggestions? :]

from sparkling-water.

jakubhava avatar jakubhava commented on September 18, 2024

Hi, thanks for trying!
In order to further debug your problem it would be great to see your YARN and H2O logs. Can you share them here ? I'll have a look on them and than we can decide were to go next.

from sparkling-water.

Dom-nik avatar Dom-nik commented on September 18, 2024

Ok, there you go. These are the logs for one run, they present the most common type of error I'm getting:
sparkling-water.yarn.log.zip
sparkling-water.zip

Here is also a log from a different run that gave a different error, it occured only once:
sparkling-water.log.2.zip

from sparkling-water.

Dom-nik avatar Dom-nik commented on September 18, 2024

Hello MadMan0708, did you have a moment to take a look at the logs?

from sparkling-water.

jakubhava avatar jakubhava commented on September 18, 2024

Hi Dom-nik,
really sorry for the delay. I'm trying to finish few changes in the last few weeks which takes most of my time.

I'll check the logs today and let you know

Thanks for patience,
Kuba

from sparkling-water.

Dom-nik avatar Dom-nik commented on September 18, 2024

Thanks! Looking forward to any news!

from sparkling-water.

jakubhava avatar jakubhava commented on September 18, 2024

Ho Dom-inik

so after looking at the logs this is what I get:

The H2O cluster of size 3 is successfully created ( from h2o executors logs in the yarn log) but it seems like the h2o client in the driver is not able to communicate with the rest of the cluster.

There are 2 things you can do:

  1. check that your firewall allows h2o communication. It can be the case that your firewall rules are very strict and allow only spark communication
  2. H2O set the `-network

-network [, ...]: Specify a range (where applicable) of IP addresses (where represents the first interface, represents the second, and so on). The IP address discovery code binds to the first interface that matches one of the networks in the comma-separated list. For example, 10.1.2.0/24 supports 256 possibilities.

and sparkling water provides configuration property spark.ext.h2o.network.mask where you can set the desired configuration. This value is then set as value for -network when starting h2o nodes inside spark.

You can set this property for example as spark configuration property when starting sparkling-shell in normal way as ./bin/sparkling-shell --conf spark.ext.h2o.network.mask=10.1.2.0/24

Let me please know if that helps!

Kuba

from sparkling-water.

kawaa avatar kawaa commented on September 18, 2024

@madman0708. I run into the same issue and I tried all tips provided in this conversation but without the success. I tried it on two different CDH clusters.

I even get this error even when I use a single-node Cloudera Quickstart VM (CDH 5.5.0, Spark 1.5.0, Sparkling Water 1.5.14). Could you @madman0708 confirm Sparkling Water 1.5.14 works fine with CDH 5.5.X or Spark 1.5.X? Alternatively can you provide the versions that should integrate smoothly?

from sparkling-water.

Dom-nik avatar Dom-nik commented on September 18, 2024

We have some valuable debugging results. It seems that H2O doesn't support multihoming, which is quite a typical thing, as it is not supported by Hadoop in general.

Context: we have our Cloudera Hadoop cluster deployed on specialized hardware, called Big Data Aplliance (BDA), an Oracle product. Multihoming is used in Big Data Appliance, as cluster nodes communicate with each other via InfiniBand using their internal network, using INTERNAL IP addresses and they communicate with the rest of P&G intranet using EXTERNAL IP addresses.

CDH (and Hadoop in general) doesnโ€™t support multihoming (cluster nodes belonging to multiple networks). Multihoming is supported for some appliances (BDA being one of them), but our edge nodes are not within the BDA, which is a non-standard setup. So when you add non-BDA nodes you are out of the supported/recommended configuration from both the Oracle side and Cloudera side. It is not sub-optimal setup. It is just that Hadoop and related technologies (unfortunately) have not really been designed with multi-homed networking in mind.

This causes connectivity issues, as (according to a Cloudera expert):

Historically we have had issues running pyspark from non-BDA nodes because of similar issues. We have also had issues running spark shell that we have worked around by specifying IP addresses instead of hostnames.

This hypothesis was confirmed by running Sparkling Water directly on one of the cluster nodes:
We tried to run Sparkling Water on a BDA node and it seems to work fine. We used sparkling-water-1.5.6 and steps from http://h2o-release.s3.amazonaws.com/sparkling-water/rel-1.5/6/index.html (the RUN ON HADOOP tab). An example command like:
$ spark-submit --class water.SparklingWaterDriver --master yarn-client --num-executors 8 --driver-memory 8g --executor-memory 4g --executor-cores 1 assembly/build/libs/*.jar
worked fine.

Do you have any comments to add? Do you plan to dig deeper into a case like this or is it totally outside your scope?

from sparkling-water.

mmalohlava avatar mmalohlava commented on September 18, 2024

Hi Dominik,

is it possible to share privately logs from Spark run?

My point is that if Spark is communicating (can see executors and send/receive messages), then in H2O we should follow the same communication paths. If not, we need to help H2O to share the same IP/port.
My theory is that the driver H2O (living in the same JVM as Spark driver)

You can try to specify spark.ext.h2o.network.mask to force the H2O driver (living in Spark driver) to select the right IP on right interface...

from sparkling-water.

Dom-nik avatar Dom-nik commented on September 18, 2024

Hi Michal,

Thanks for your reply. It seems that it got cut in the middle :]

You can find new batch of YARN logs here: sparkling.yarn.logs.27062016.tar.gz
(Sparkling water failed with the following error after I tried to connect to Flow on 54321 port):
Exception in thread "main" java.lang.RuntimeException: Cloud size under 3

I tried running the applciation with spark.ext.h2o.network.mask

spark-submit \
--class water.SparklingWaterDriver \
--master yarn-client \
--num-executors 3 \
--driver-memory 4g \
--executor-memory 2g \
--executor-cores 1 /opt/sparkling-water/sparkling-water-1.5.14/assembly/build/libs/*.jar \
--conf "spark.scheduler.minRegisteredResourcesRatio=1" "spark.ext.h2o.topology.change.listener.enabled=false" "spark.ext.h2o.network.mask=192.168.7.0/255"

but it behaved exactly the same.
The YARN logs from this run are here:
sparkling.yarn.logs.27062016.2.tar.gz

I'm not 100% sure if the mask was specified correctly.
EDIT: I know it was not, but I've tried with 192.168.7.0/24 too and it failed.

from sparkling-water.

Dom-nik avatar Dom-nik commented on September 18, 2024

Just to close the case with some relevant info. There was some debugging that we've done with H2O and a custom patch was developed (released with Sparkling Water 1.5.16). It enables to use a new parameter spark.ext.h2o.node.network.mask to specify a mask for internal IPs.

Here's a way to run the tool so that it works:

spark-submit \
--class water.SparklingWaterDriver \
--master yarn-client \
--num-executors 3 \
--driver-memory 4g \
--executor-memory 2g \
--executor-cores 1 \
--conf "spark.scheduler.minRegisteredResourcesRatio=1" \
--conf "spark.ext.h2o.topology.change.listener.enabled=false" \
--conf "spark.ext.h2o.node.network.mask=<IP_NUMBER>/<MASK>" \
--conf "spark.ext.h2o.fail.on.unsupported.spark.param=false" \
/opt/sparkling-water/sparkling-water-1.5.16/assembly/build/libs/*.jar

e.g. "spark.ext.h2o.node.network.mask=10.0.00/24"

from sparkling-water.

jakubhava avatar jakubhava commented on September 18, 2024

Hi @Dom-nik,

thank you again from writing the outcome!

from sparkling-water.

ibobak avatar ibobak commented on September 18, 2024

No matter of applying all these settings, I am receiving the same error:

Py4JJavaError: An error occurred while calling z:org.apache.spark.h2o.JavaH2OContext.getOrCreate.
: java.lang.RuntimeException: Cloud size under 11
	at water.H2O.waitForCloudSize(H2O.java:1689)
	at org.apache.spark.h2o.backends.internal.InternalH2OBackend.init(InternalH2OBackend.scala:117)
	at org.apache.spark.h2o.H2OContext.init(H2OContext.scala:121)
	at org.apache.spark.h2o.H2OContext$.getOrCreate(H2OContext.scala:355)
	at org.apache.spark.h2o.H2OContext$.getOrCreate(H2OContext.scala:371)
	at org.apache.spark.h2o.H2OContext.getOrCreate(H2OContext.scala)
	at org.apache.spark.h2o.JavaH2OContext.getOrCreate(JavaH2OContext.java:228)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:280)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:748)

Here are my configs in the Notebook's kernel:

"PYSPARK_SUBMIT_ARGS":" --py-files /usr/local/share/jupyter/kernels/sparkling-water-2.2.2/py/build/dist/h2o_pysparkling_2.2-2.2.2.zip 
  --conf \"spark.scheduler.minRegisteredResourcesRatio=1\" 
  --conf \"spark.ext.h2o.topology.change.listener.enabled=false\" 
  --conf \"spark.ext.h2o.fail.on.unsupported.spark.param=false\" 
  --conf \"spark.ext.h2o.node.network.mask=10.5.33.0/24\" 
  --jars /usr/local/share/jupyter/kernels/aws-lib/hadoop-aws-2.7.3.jar,/usr/local/share/jupyter/kernels/aws-lib/aws-java-sdk-1.7.4.jar   
  --driver-memory 8G 
  --executor-memory 24G   
  --conf \"spark.dynamicAllocation.enabled=false\" 
  --num-executors 10 
  --executor-cores 2 
  --master spark://10.5.33.36:7077 pyspark-shell"    

I am using Spark 2.2.0 with sparkling water 2.2.2.

I the Spark app I clearly see that it started one driver and 10 executors, and (as you see) the amount of executors is explicitly configured. No matter of that, this annoying error simply doesn't allow to run the H2O.

I'll be very grateful for any ideas of how to run it.

from sparkling-water.

idoshichor avatar idoshichor commented on September 18, 2024

Hello @jakubhava ,

Does sparkling water already support spark.dynamicAllocation.enabled=true.

We want to use it on Spark, but scaling the cluster up and down is very important for us.

Thanks.

from sparkling-water.

jakubhava avatar jakubhava commented on September 18, 2024

Hi @idoshichor,
There are two backends in SParkling Water - internal and external. In external backend, you can use the spark.dynamicAllocation.enabled=true option and Spark can kill or join new executors without affecting H2O.

In the internal backend, this option is not allowed and we think that it won't be available there because of several technical reasons. If you need to use the dynamic allocation, I would advise looking at the external backend solution.

from sparkling-water.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.