Conda is growing in popular

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

closed via: <a class="issue-link js-issue-link" data-error-text="Failed to load title"

Create init action for installing conda,about googleclouddataproc/initialization-actions

Comments (6)

chimerasaurus commented on June 28, 2024

Interesting idea. I will take a look at this over the weekend. My initial thinking is I'd want to make sure (or figure out how) it plays (or does not) with PySpark.

from initialization-actions.

nehalecky commented on June 28, 2024

@evilsoapbox, thanks for the reply. I've actually put together a few shell scripts to start this endeavor. If you're interested, I'll submit a PR to review. Your help would be greatly appreciated on a few questions I have. :)

from initialization-actions.

nehalecky commented on June 28, 2024

Hey @evilsoapbox, Happy New Year! 🎉

So, this PR is close to being complete from a functionality perspective, but needs a bit of review and feedback from someone that knows a bit more about Dataproc than myself. :)

Regarding the need to make sure that it plays nice with PySpark, I followed a few posts on how others have managed setting up pyspark + conda, here:

Cloudera: Prepare Your Apache Hadoop Cluster for PySpark Jobs
StackOverflow: Installing Modules for SPARK on worker nodes

I wasn't able to get the PYSPARK_PYTHON environment variable to be properly set by exporting in the spark-env.sh as detailed in the Cloudera post above, but instead set it by adding an export to .bashrc in root. From all I can tell, this seems to be correctly set for all shells, except for remote job submittal using the gcloud API from the command line (see below).

For testing, besides getting the cluster to launch, Spark shell to run, and executing a few examples, we need to ensure that the worker nodes (executors) reference the correct (conda) python distro. I do that by submitting a simple pyspark job that prints the distinct paths to the python executable found across each partition in an RDD.

import pyspark
import numpy as np
import sys

sc = pyspark.SparkContext()

data = np.random.randn(10e3)
distData = sc.parallelize(data)
python_execs = distData.map(lambda x: sys.executable).distinct().collect()

print python_execs

Calling from the root account on master node seems to work fine:

> spark-submit get-sys-exec.py
...
['/usr/local/bin/miniconda/bin/python']
...

however, when submitting a job remotely using the dataproc API, it references the default python distribution:

> cloud beta dataproc jobs submit pyspark --cluster $DATAPROC_CLUSTER_NAME get-sys-exec.py
...
['/usr/bin/python']
...

Any ideas as to why?

Finally, sorry about the noisy commit history. Without an automated test service, I had to incrementally develop and test by manually launching and deleting dataproc clusters. When we get to a point where we're ready to merge, I can squash the commits into something more reasonable. :)

Thanks again, and look forward to your reply. 😄

from initialization-actions.

chimerasaurus commented on June 28, 2024

Hey @nehalecky

Sorry for the lag on this thread. We're getting back up to 100% after the US holidays. We will take a peek at this starting this week and provide some feedback. I just wanted to give you an update so you know why we've been so quiet and what we're going to do.

Cheers!

James

from initialization-actions.

nehalecky commented on June 28, 2024

Hey James (@evilsoapbox), thanks for the update! Sounds good and look forward to the feedback. Let me know how I can help. :)

from initialization-actions.

nehalecky commented on June 28, 2024

closed via: #18

from initialization-actions.

Create init action for installing conda about initialization-actions HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent