Giter VIP home page Giter VIP logo

Comments (6)

chimerasaurus avatar chimerasaurus commented on June 28, 2024

Interesting idea. I will take a look at this over the weekend. My initial thinking is I'd want to make sure (or figure out how) it plays (or does not) with PySpark.

from initialization-actions.

nehalecky avatar nehalecky commented on June 28, 2024

@evilsoapbox, thanks for the reply. I've actually put together a few shell scripts to start this endeavor. If you're interested, I'll submit a PR to review. Your help would be greatly appreciated on a few questions I have. :)

from initialization-actions.

nehalecky avatar nehalecky commented on June 28, 2024

Hey @evilsoapbox, Happy New Year! 🎉

So, this PR is close to being complete from a functionality perspective, but needs a bit of review and feedback from someone that knows a bit more about Dataproc than myself. :)

Regarding the need to make sure that it plays nice with PySpark, I followed a few posts on how others have managed setting up pyspark + conda, here:

I wasn't able to get the PYSPARK_PYTHON environment variable to be properly set by exporting in the spark-env.sh as detailed in the Cloudera post above, but instead set it by adding an export to .bashrc in root. From all I can tell, this seems to be correctly set for all shells, except for remote job submittal using the gcloud API from the command line (see below).

For testing, besides getting the cluster to launch, Spark shell to run, and executing a few examples, we need to ensure that the worker nodes (executors) reference the correct (conda) python distro. I do that by submitting a simple pyspark job that prints the distinct paths to the python executable found across each partition in an RDD.

import pyspark
import numpy as np
import sys

sc = pyspark.SparkContext()

data = np.random.randn(10e3)
distData = sc.parallelize(data)
python_execs = distData.map(lambda x: sys.executable).distinct().collect()

print python_execs

Calling from the root account on master node seems to work fine:

> spark-submit get-sys-exec.py
...
['/usr/local/bin/miniconda/bin/python']
...

however, when submitting a job remotely using the dataproc API, it references the default python distribution:

> cloud beta dataproc jobs submit pyspark --cluster $DATAPROC_CLUSTER_NAME get-sys-exec.py
...
['/usr/bin/python']
...

Any ideas as to why?

Finally, sorry about the noisy commit history. Without an automated test service, I had to incrementally develop and test by manually launching and deleting dataproc clusters. When we get to a point where we're ready to merge, I can squash the commits into something more reasonable. :)

Thanks again, and look forward to your reply. 😄

from initialization-actions.

chimerasaurus avatar chimerasaurus commented on June 28, 2024

Hey @nehalecky

Sorry for the lag on this thread. We're getting back up to 100% after the US holidays. We will take a peek at this starting this week and provide some feedback. I just wanted to give you an update so you know why we've been so quiet and what we're going to do.

Cheers!

James

from initialization-actions.

nehalecky avatar nehalecky commented on June 28, 2024

Hey James (@evilsoapbox), thanks for the update! Sounds good and look forward to the feedback. Let me know how I can help. :)

from initialization-actions.

nehalecky avatar nehalecky commented on June 28, 2024

closed via: #18

from initialization-actions.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.