jadianes / spark-py-notebooks Goto Github PK

View Code? Open in Web Editor NEW

1.6K 98.0 912.0 2.26 MB

Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks

Home Page: http://jadianes.github.io/spark-py-notebooks

License: Other

Jupyter Notebook 100.00%

spark python pyspark data-analysis mllib ipython-notebook notebook ipython data-science machine-learning

spark-py-notebooks's Issues

Website isn't working

Thanks for the tutorials!
The domain of the website is probably expired and the .github.io link is routing to that domain too.

Possible solutions:

Renew the domain subscription
Cancel the alias or record that's causing the GitHub page to go to the custom domain

urllib module in nb1-rdd-creation

I think for python3.x users,urllib module has been split into several modules and therefore
import urllib.request.urlretrieve will make more sense i guess.
Possibly update on the same if you thing is needed.

Apparent Memory Issues

juyptererror.txt
commandprompt.txt
commandprompterror.txt

Hi - I am a student attempting to learn how to use PYSPSARK/JUPYTER to build classification models for large data. I installedPYSPARK V2.2.1 and Juypter as per tutorial on medium website by Michael Galarnyk. It seemed to install ok and I was able to run your first notebook. However in the second notebook nb2-rdd-basics I had problems with the "collect" code

from time import time
t0 = time()
head_rows = csv_data.take(100000)
tt = time() - t0
print "Parse completed in {} seconds".format(round(tt,3))
Thinking it was a memory issue I then launched Jupyter with command
pyspark --master local[4] --driver-memory 32g --executor-memory 32g
I have attached the Juypter error and command prompt data before and after error
Please help - how do I increase memory in the kernel

Question on: Pyspark MLib Model want to deploy on docker, But the performance is out of expectation

Env: spark standalone on docker

Case: the trained pyspark model (randomforest) deployed on docker

Questions: When I use gunicorn to start the service, including (model loading, prediction) and expose API service with Python Flask framework, it seems pretty slow to call the api..

Could I get your help or any suggestions on spark model deployment? Thanks!

[bug] About nb10-sql-dataframes.ipynb (DF.map→RDD.map)

@jadianes
hello I'm Hiroyuki.
nice Tutorial, Thank you!

In[7]

tcp_interactions_out = tcp_interactions.map(lambda p: "Duration: {}, Dest. bytes: {}".format(p.duration, p.dst_bytes))
for ti_out in tcp_interactions_out.collect():
  print ti_out

but map can use only for RDD.
so we need to change tcp_interactions(DataFrame) to RDD , I think.

here is the sample

tcp_interactions_out = tcp_interactions.rdd.map(lambda p: "Duration: {}, Dest. bytes: {}".format(p.duration, p.dst_bytes))
for ti_out in tcp_interactions_out.collect():
  print ti_out

how do you think about it?

If there is my mistake in my code or in my sentence , sorry. (couse Im not good at writting English)
please forgive me if I make you feel bad.

spark context

I had an issue with the command line
$ MASTER="spark://127.0.0.1:7077" SPARK_EXECUTOR_MEMORY="1G" IPYTHON_OPTS="notebook --pylab inline" /home/philippe/Downloads/spark-master/bin/pyspark

the error was Connection refused: /127.0.0.1:7077

and was resolved with
$ MASTER=local[4] SPARK_EXECUTOR_MEMORY="1G" IPYTHON_OPTS="notebook --pylab inline" /home/philippe/Downloads/spark-master/bin/pyspark
maybe you could say a word in the readme about it.

Otherwise great notebooks and great help Thank you!

Logistic Regression with LBFGS in Spark 1.6 and 2.1

@jadianes Nice tutorial on Logistic Regression, thankyou.
I ran the tutorial on Spark 1.6.2 and 2.1.0 - both ran fine and I could repeat your results perfectly in 1.6.2, but I would like to offer the following observation re 2.1.0. In 2.1.0 the process takes about 3 times longer to run and produces a different answer than that produced by 1.6.2. I thought this was strange and found that in the list of Spark tasks 2.1.0 was calling a non-LBFGS algorithm. I raised this issue in a JIRA question (https://issues.apache.org/jira/browse/SPARK-16768). It seems that even though a user can import the LBFGS version into pyspark and you can call help on it and actually call it, I don't think it is actually an LBFGS version.
http://spark.apache.org/docs/latest/mllib-optimization.html has some other information on LBFGS in Spark.
Later when 2.1.0 becomes the standard your readers may find that they don't get your results for accuracy. Or maybe I just missed something, can anyone confirm my observations?

jadianes / spark-py-notebooks Goto Github PK

spark-py-notebooks's Issues

Website isn't working

urllib module in nb1-rdd-creation

Apparent Memory Issues

Question on: Pyspark MLib Model want to deploy on docker, But the performance is out of expectation

[bug] About nb10-sql-dataframes.ipynb (DF.map→RDD.map)

spark context

Logistic Regression with LBFGS in Spark 1.6 and 2.1

The notebooks arent loading

license?

Integrate with k8s

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent