Comments (6)
Interesting idea. I will take a look at this over the weekend. My initial thinking is I'd want to make sure (or figure out how) it plays (or does not) with PySpark.
from initialization-actions.
@evilsoapbox, thanks for the reply. I've actually put together a few shell scripts to start this endeavor. If you're interested, I'll submit a PR to review. Your help would be greatly appreciated on a few questions I have. :)
from initialization-actions.
Hey @evilsoapbox, Happy New Year! 🎉
So, this PR is close to being complete from a functionality perspective, but needs a bit of review and feedback from someone that knows a bit more about Dataproc than myself. :)
Regarding the need to make sure that it plays nice with PySpark, I followed a few posts on how others have managed setting up pyspark + conda, here:
- Cloudera: Prepare Your Apache Hadoop Cluster for PySpark Jobs
- StackOverflow: Installing Modules for SPARK on worker nodes
I wasn't able to get the PYSPARK_PYTHON
environment variable to be properly set by exporting in the spark-env.sh
as detailed in the Cloudera post above, but instead set it by adding an export to .bashrc
in root. From all I can tell, this seems to be correctly set for all shells, except for remote job submittal using the gcloud API from the command line (see below).
For testing, besides getting the cluster to launch, Spark shell to run, and executing a few examples, we need to ensure that the worker nodes (executors) reference the correct (conda) python distro. I do that by submitting a simple pyspark job that prints the distinct paths to the python executable found across each partition in an RDD.
import pyspark
import numpy as np
import sys
sc = pyspark.SparkContext()
data = np.random.randn(10e3)
distData = sc.parallelize(data)
python_execs = distData.map(lambda x: sys.executable).distinct().collect()
print python_execs
Calling from the root account on master node seems to work fine:
> spark-submit get-sys-exec.py
...
['/usr/local/bin/miniconda/bin/python']
...
however, when submitting a job remotely using the dataproc API, it references the default python distribution:
> cloud beta dataproc jobs submit pyspark --cluster $DATAPROC_CLUSTER_NAME get-sys-exec.py
...
['/usr/bin/python']
...
Any ideas as to why?
Finally, sorry about the noisy commit history. Without an automated test service, I had to incrementally develop and test by manually launching and deleting dataproc clusters. When we get to a point where we're ready to merge, I can squash the commits into something more reasonable. :)
Thanks again, and look forward to your reply. 😄
from initialization-actions.
Hey @nehalecky
Sorry for the lag on this thread. We're getting back up to 100% after the US holidays. We will take a peek at this starting this week and provide some feedback. I just wanted to give you an update so you know why we've been so quiet and what we're going to do.
Cheers!
James
from initialization-actions.
Hey James (@evilsoapbox), thanks for the update! Sounds good and look forward to the feedback. Let me know how I can help. :)
from initialization-actions.
closed via: #18
from initialization-actions.
Related Issues (20)
- [hue] hive editor missing.
- [oozie] intermittent error writing to HDFS during init action HOT 1
- [gpu] ml-on-gcp repo (gpu metrics dependency) to be archived
- Missing linux headers on debian dataproc instances after update HOT 6
- Terraform provider does not offer a sequential ordering option - implement as init action HOT 2
- [bigtable] 2.1 clusters fail to come online with stock bigtable/bigtable.sh HOT 2
- [livy] update livy init action for 2.1 HOT 1
- [rapids] please update to work with latest dask-rapids v22.12
- [gpu] Driver does not install on 2.2 Rocky/Ubuntu images
- [zeppelin] not supported on 2.1+ image versions HOT 1
- Error on wget livy binary naming HOT 5
- [spark-rapids] Drop Spark 2.x support in spark-rapids.sh
- [gpu] apt-get update Init script seeing broken repositories HOT 2
- [bigtable] apt-get update Init script seeing broken repositories
- [cloud-sql-proxy] Running the Cloud SQL Proxy as a persistent service
- Update initialization scripts to install latest RAPIDS `23.12` OR `24.02`
- [gpu] Add tests for GPU agent
- initialization actions which use apt-get update fail due to purged oldoldstable backports repository HOT 10
- rstudio.sh is unable to get the receive keys. Maybe due to invalid repo key. HOT 1
- Dataproc "apt-get update" failed on ubuntu20 HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from initialization-actions.