Comments (26)
Yes! In fact Sparkling Water provides H2O on top of Spark cluster. So you can access all H2O services - including R, Python interfaces. Look into H2O-DEV project to see examples of R code - https://github.com/h2oai/h2o-dev/tree/master/h2o-r/tests
The deployment into EC2 environment depends on Spark infrastructure. You need Spark cluster to be running and then you just submit sparkling water to the cluster.
I will leave this issue open and try to provide more examples involving cooperation with R.
from sparkling-water.
Where exactly should I be looking in the h20 tests to see how Spark is being used under the hood for h2o algos in R...
from sparkling-water.
Hi,
H2O algos are written from scratch in H2O (see the h2o-dev github).
We are not currently using Spark to implement H2O algos.
The examples and tests mostly use Spark for data selection and preprocessing (e.g. Spark SQL).
Then H2O algos are called.
You could also call MLlib algos if you wish.
Sparkling water has both h2o-dev and Spark as a maven dependency.
Thanks,
Tom
On May 8, 2015, at 3:41 AM, hmaeda [email protected] wrote:
Where exactly should I be looking in the h20 tests to see how Spark is being used under the hood for h2o algos...
—
Reply to this email directly or view it on GitHub.
from sparkling-water.
Hi,
I am interested to use h2o algorithms (like RF/GBM) to perform a classification task on a dataset loaded in Spark. Scala is on my "to-learn" list but at this point of time, I would like to use R. Is it possible to write R code that calls h2o algorithms on data in spark? If yes, if you guys can produce an example (like the ones you have provided using Scala) that would be immensely helpful!
Thanks :-)
from sparkling-water.
Even I seem to be having the same doubt. The Sparkling Water FAQ says that it is possible but that doesnt seem to be reflecting in any of the examples. It would be great if you guys could clarify to us on the above query.
Thanks in Advance :)
from sparkling-water.
@binga yes, you can connect from R to running H2O/Sparkling water cluster and run algos, analyze data, or do feature munging. See docs.h2o.ai for R-example
Also you can look at some example in Sparkling Water incorporating R:
- Prepare data/models in Spark/Sparkling Water and use them from R https://github.com/h2oai/sparkling-water/blob/master/examples/meetups/Meetup20150326.md
- Analysis on airline data, making a regression model in Sparkling Water, and producing residuals plot in R. See code https://github.com/h2oai/sparkling-water/blob/master/examples/meetups/Meetup20150203.md
The main point is that, if you have running H2O/Sparkling Water, you can combine different clients to drive computation - prepare data in Sparkling Water+Spark, build model from R, analyze predicted results from Flow UI.
from sparkling-water.
@phanisrinath please look at examples posted above, or come to our Sparkling Water meetups to see the interoperability
from sparkling-water.
Will there be any integration with sparkR?
from sparkling-water.
I am not sure if it would be useful right now, our R approach is totally different from SparkR, we are focused on being transparent for R users and distributing regular R-operations in backend.
However, we can probably expose H2OContext primitives inside sparkR.
from sparkling-water.
Now that Spark is supported by EMR as suggested here and here, would it be possible to request a tutorial on how to get R, H2O/(Sparkling Water), and Spark working with EMR with the given AMIs? Or are there other AMIs that you would recommend?
from sparkling-water.
Hi,
right now we need to test Sparkling water in EMR environment.
But since Sparkling Water is just a jar which needs to be passed via Spark --jar option to
spark-submit I expect that that integration will be easy.
To use R with it, you just need to point R's h2o.init to IP/PORT of the cluster.
michal
Dne 6/18/15 v 3:54 AM hmaeda napsal(a):
Now that Spark is supported by EMR as suggested here
https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark and here
https://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-spark-configure.html,
would it be possible to request a tutorial on how to get R and H2O working with EMR with the given
AMIs? Or are there other AMIs that you would recommend?—
Reply to this email directly or view it on GitHub
#8 (comment).
from sparkling-water.
Add me to the list. I have many R models based on R packages that I do not want to convert to h2o. I have yet to find a simple walkthrough to take current R models and port them through h2o.
If it's not possible that's cool, but it seems like posts suggest it is but no one shows how it's done.
from sparkling-water.
@bigfantasyfootball can you provide more details?What kind of models do you have in R?
from sparkling-water.
Hi @mmalohlava thanks for inquiring. If I could see how the ridiculously simple example below works using h2o I could figure out the rest.
library(caret)
data(iris)
rf <- train(Species ~ ., data = iris, method = "rf")
from sparkling-water.
Hiya,
I'm also interested in integrating these technologies - can you simply pooint sparkling water at ./bin/sparkR instead of /bin/spark? is there a compilation option to change the executable?
Alex
from sparkling-water.
Hi Alex,
can you more elaborate on your idea? It sounds interesting but I am not sure what you mean by
changing /bin/spark to /bin/sparkR.
Just to clarify: Sparkling water is build on top of spark as an application, so it depends on Spark
infrastructure and utilities (spark-submit, spark-shell).
Best regards,
Michal
On 9/29/15 4:10 AM, Alex Shires wrote:
Hiya,
I'm also interested in integrating these technologies - can you simply pooint sparkling water at
./bin/sparkR instead of /bin/spark? is there a compilation option to change the executable?
Alex—
Reply to this email directly or view it on GitHub
#8 (comment).
from sparkling-water.
Hi Michal,
Yes, I was confused! SparkR is an R package that imports the spark library, methods etc into a R shell. As far as I understand it, sparkling water builds the H20 layer/interface on top of spark - so if one adds the right dependencies to the SBT files in sparkR for H2O, then will it be able to link against both?
I'll need to dig a bit further to look into creating R packages....
Alex
from sparkling-water.
Hi Michal,
as far as I understand it, in order to get Sparkling Water working within R, we'll need to write wrappers using the Spark<->R API for the functions we care about - or expose all the primitives for future developers. The other solution would be to write scala functions including the Sparkling Water and then wrap them in the R API to expose a much more simple level of information transfer.
Which of these sounds easier to you? I'm tempted to go for the second one - write a scala package that depends on Sparkling Water, using the machine learning I need from H2O and then write an R wrapper and try to include it so it gets built with SparkR. It's not a trivial problem....
Regards,
Alex
from sparkling-water.
Hi Alex,
yes, you are right! And we are working on it (but right now for Python and pySpark).
So let me explain, how H2O's R works - H2O exposes REST API which exposes capabilities provided by
Java API.
On the top of REST API, we built several clients including R/Python/Flow UI.
So if you import H2O's R package into your R session ( library(h2o)
- it is available via CRAN),
we override few functions, but to make it work, you have to connect your R client to an existing H2O
cluster via client <- h2o.init(ip=..., port=...)
From this point you can use all capabilities provided by h2o package
(http://h2o-release.s3.amazonaws.com/h2o/rel-slater/5/docs-website/h2o-r/h2o_package.pdf).
The trick is that if you run H2O on top of Spark (as Sparkling Water), you have access to the same
REST API.
So you can connect to it from your local machine, but even more you can connect to it also from
sparkR code (in theory, but it is one of our goals for next months).
Does it make sense?
Michal
On 9/29/15 9:55 AM, Alex Shires wrote:
Hi Michal,
Yes, I was confused! SparkR is an R package that imports the spark library, methods etc into a R
shell. As far as I understand it, sparkling water builds the H20 layer/interface on top of spark -
so if one adds the right dependencies to the SBT files in sparkR for H2O, then will it be able to
link against both?I'll need to dig a bit further to look into creating R packages....
Alex
—
Reply to this email directly or view it on GitHub
#8 (comment).
from sparkling-water.
Hi Michal,
Is is possible to show a quick example, of loading the iris data set from R, converting it to an RDD, and uploading it into h2o in sparkling water as a frame? I know that there is the h2o R package that can do this (the as.h2o() function), but at the moment it is currently written to write to disk first before uploading a file. I am hoping that by using sparkling water there will be a process by which the data does not get written to disk first. (My actual data set is in RAM and is very large, and writing to disk first is an expensive operation.)
Furthermore, all of the examples I have seen seem to focus on uploading files that are already on disk and not on data that is already in memory/RAM into h2o, an example that shows data moving from in RAM (in R) to h2o would be appreciated, assuming that this is possible.
Regards,
Hiddi
from sparkling-water.
Hi Hiddi,
so lets expect Spark with Sparkling Water is running - that means you created H2OContext
and started it (in Spark shel or in a standalone application) by some code like this one:
import org.apache.spark.h2o._
val hc = new H2OContext(sc).start()
H2O context gives you address of entry point to access from R:
h2oContext: org.apache.spark.h2o.H2OContext =
Sparkling Water Context:
* number of executors: 3
* list of used executors:
(executorId, host, port)
------------------------
(0,michals-mbp.0xdata.loc,54327)
(1,michals-mbp.0xdata.loc,54331)
(2,michals-mbp.0xdata.loc,54323)
------------------------
Open H2O Flow in browser: http://172.16.2.223:54321 (CMD + click in Mac OSX)
So you can now use R to connect to a cluster and load data:
library(h2o)
h = h2o.init(ip="172.16.2.223", port="54321")
iris.hex <- as.h2o(iris)
Now you should see in Flow UI (open http://172.16.2.223:54321) iris.hex
dataset.
However, as you mention it will write file to a disk. BUT you can use h2o.uploadFile
or h2o.importFile
from HDFS if it is more handy. Right now we do not have support to upload in-memory data but technically i can image a solution - just stream data to h2o, not big deal.
If you are interested please file a jira for your use-case (http://jira.h2o.ai)
btw: would tachyon help you? We have Tachyon support disabled but it should be easy to enable it back.
from sparkling-water.
Hi Michal,
Support for upload in-memory data would be amazing, and I am also very interested in the concept of streaming data to h2o too! Not too sure how to 'file a jira' as you suggest. But my use-case would be that, I regularly receive several packets of a smallish datasets, these are loaded into R's memory/environment, so that lots of additional/statistical features can be added/calculated/created adding many columns to the original data. This is then becomes very large and a very expensive operation to write the data to disk, before upload into h2o's environment. Therefore I would like to have an in-memory process of uploading that data from R's memory to h2o's environment. (Ideally at memory speed...)
How would data from R be streamed into h2o? Would it need something like kafka? by perhaps using something like the rkafka package? Or would the data need to be in spark first?
Also would perhaps the rscala package? be useful in getting data into scala to convert to a spark dataframe? to be sent to h2o?
Separately, just so that I understand how the process works correctly? How does data transfer from spark to h2o in a sparkling water setup? is it written to disk as well?
Had not heard of Tachyon until you mentioned it, but from a quick google, the concept of Tachyon sounds amazing. Not too sure how to implement it for uploading data from R to h2o though? Some simple examples of how to use Tachyon for h2o would be much appreciated.
Regards,
Hiddi
from sparkling-water.
Hi Michal,
I have been following this conversation and I'm interested in the Tachyon support. We are about to deploy a Tachyon + Spark cluster for a new project, and I'd love to trial H20 for the machine-learning components of the project.
Is this currently possible? Using PySparkling Water, for example, could I do the following:
hc = H2OContext(sc).start()
hc.textFile("tachyon://path_to_file")
Or am I missing something?
With thanks,
Will
from sparkling-water.
Hello Michal
We would like to support the following use case:
- Create/load a spark data-frame within R (using sparkR).
- Modify this data frame. Then we want to apply h2o machine learning methods on this data-frame. We therefore would like to upload this data-frame to h2o from spark. We do not wish to use disk as an intermediary for this purpose.
Do we have any example script which would demonstrate the above? Michal suggests the following " "just stream data to h2o, not big deal." - do we have any code-snippet which can demo this idea?
Thanks very much
from sparkling-water.
Hello Michal,
I am initiating Spark using SparkR on Hadoop cluster (deploy-mode=yarn-client)
Sys.setenv(HADOOP_CONF_DIR = "/etc/hadoop/cloudera-prod/conf.cloudera.yarn")
Sys.setenv(SPARK_HOME = "/home/softs/spark-1.6.1-bin-hadoop2.6")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
library(SparkR)
sc <- sparkR.init(master="yarn-client")
# pkgs <- c("methods","statmod","stats","graphics","RCurl","jsonlite","tools","utils")
# for (pkg in pkgs) {
# if (! (pkg %in% rownames(installed.packages()))) { install.packages(pkg) }
# }
#
# # Now we download, install and initialize the H2O package for R.
# install.packages("h2o", type="source", repos=(c("http://h2o-release.s3.amazonaws.com/h2o/rel-turin/4/R")))
library(h2o)
I want to run H2O on that SparkContext without going out from R environment. Can I run anything like below in R/SparkR
hc = H2OContext(sc).start()
Thanks,
Manu
from sparkling-water.
Please refer to https://github.com/h2oai/rsparkling repo where rsparkling is currently located for more information
from sparkling-water.
Related Issues (20)
- Sparkling Water not properly configuring RAM on Databricks HOT 1
- R docker build failing again
- h2o-pysparkling-3.x does not support pep517 builds HOT 4
- Install proper setuptools
- Scala 2.13 support - part 1 - investigation
- Scala 2.13 support - part 2 - implementation
- Use newer Ubuntu in test docker image
- Upgrade H2O to 3.44.0.3
- Can't install pysparkling after updating setuptools >= 69.0.0 HOT 2
- Quiet and Embedded arguments are not working in the last version 3.44.0.3 HOT 1
- libxgboost.so getting filled in /tmp HOT 8
- Error - Spark parameters on H2O Sparkling water SIG
- describe an h2oframe HOT 2
- describe an h2oframe
- RestApiCommunicationException: H2O node http://10.159.20.11:54321 responded with HOT 1
- Upgrade H2O to 3.46.0.1
- docs: out of date Spark version listings HOT 1
- AIC/Loglikelihood metrics generation problems
- when will sparkling-water 3.46.0.1 be released? HOT 1
- expose uuid for dai mojo
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from sparkling-water.