Giter VIP home page Giter VIP logo

jadianes / spark-py-notebooks Goto Github PK

View Code? Open in Web Editor NEW
1.6K 98.0 912.0 2.26 MB

Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks

Home Page: http://jadianes.github.io/spark-py-notebooks

License: Other

Jupyter Notebook 100.00%
spark python pyspark data-analysis mllib ipython-notebook notebook ipython data-science machine-learning

spark-py-notebooks's Issues

Website isn't working

Thanks for the tutorials!
The domain of the website is probably expired and the .github.io link is routing to that domain too.

Possible solutions:

  1. Renew the domain subscription
  2. Cancel the alias or record that's causing the GitHub page to go to the custom domain

urllib module in nb1-rdd-creation

I think for python3.x users,urllib module has been split into several modules and therefore
import urllib.request.urlretrieve will make more sense i guess.
Possibly update on the same if you thing is needed.

Apparent Memory Issues

juyptererror.txt
commandprompt.txt
commandprompterror.txt

Hi - I am a student attempting to learn how to use PYSPSARK/JUPYTER to build classification models for large data. I installedPYSPARK V2.2.1 and Juypter as per tutorial on medium website by Michael Galarnyk. It seemed to install ok and I was able to run your first notebook. However in the second notebook nb2-rdd-basics I had problems with the "collect" code

from time import time
t0 = time()
head_rows = csv_data.take(100000)
tt = time() - t0
print "Parse completed in {} seconds".format(round(tt,3))
Thinking it was a memory issue I then launched Jupyter with command
pyspark --master local[4] --driver-memory 32g --executor-memory 32g
I have attached the Juypter error and command prompt data before and after error
Please help - how do I increase memory in the kernel

[bug] About nb10-sql-dataframes.ipynb (DF.map→RDD.map)

@jadianes
hello I'm Hiroyuki.
nice Tutorial, Thank you!

In[7]

tcp_interactions_out = tcp_interactions.map(lambda p: "Duration: {}, Dest. bytes: {}".format(p.duration, p.dst_bytes))
for ti_out in tcp_interactions_out.collect():
  print ti_out

but map can use only for RDD.
so we need to change tcp_interactions(DataFrame) to RDD , I think.

here is the sample

tcp_interactions_out = tcp_interactions.rdd.map(lambda p: "Duration: {}, Dest. bytes: {}".format(p.duration, p.dst_bytes))
for ti_out in tcp_interactions_out.collect():
  print ti_out

how do you think about it?

If there is my mistake in my code or in my sentence , sorry. (couse Im not good at writting English)
please forgive me if I make you feel bad.

spark context

I had an issue with the command line
$ MASTER="spark://127.0.0.1:7077" SPARK_EXECUTOR_MEMORY="1G" IPYTHON_OPTS="notebook --pylab inline" /home/philippe/Downloads/spark-master/bin/pyspark

the error was Connection refused: /127.0.0.1:7077

and was resolved with
$ MASTER=local[4] SPARK_EXECUTOR_MEMORY="1G" IPYTHON_OPTS="notebook --pylab inline" /home/philippe/Downloads/spark-master/bin/pyspark
maybe you could say a word in the readme about it.

Otherwise great notebooks and great help Thank you!

Logistic Regression with LBFGS in Spark 1.6 and 2.1

@jadianes Nice tutorial on Logistic Regression, thankyou.
I ran the tutorial on Spark 1.6.2 and 2.1.0 - both ran fine and I could repeat your results perfectly in 1.6.2, but I would like to offer the following observation re 2.1.0. In 2.1.0 the process takes about 3 times longer to run and produces a different answer than that produced by 1.6.2. I thought this was strange and found that in the list of Spark tasks 2.1.0 was calling a non-LBFGS algorithm. I raised this issue in a JIRA question (https://issues.apache.org/jira/browse/SPARK-16768). It seems that even though a user can import the LBFGS version into pyspark and you can call help on it and actually call it, I don't think it is actually an LBFGS version.
http://spark.apache.org/docs/latest/mllib-optimization.html has some other information on LBFGS in Spark.
Later when 2.1.0 becomes the standard your readers may find that they don't get your results for accuracy. Or maybe I just missed something, can anyone confirm my observations?

license?

What is the license for this repo? Apache 2.0 would be nice :)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.