Giter VIP home page Giter VIP logo

edx-spark-cs100.1x's Introduction

Introduction to Big Data with Apache Spark

Here i will put all my notes about this online course : https://courses.edx.org/courses/BerkeleyX/CS100.1x/1T2015

My goal is also to learn about #python.

Note : no solution will be posted here :)

Week1

Just about the installation o start the real course. The installation is based on #vagrant, #virtualbox.

Week2

Basics about Spark, RDDs.

Spark program = DRIVER + WORKERS

RDDs are distributed across workers.

Tranformations : map, filter, distinct, flatMap

Actions : reduce, take, collect, takeOrdered

Caching your RDD is essential to speed up you spark program.

You can transform a dataset or make an action on it.

Shared Variables, 2 types :

  • Broadcast variables : read-only value send to all workers

  • Accumulators : aggregate value from workers to driver

  • Links

https://spark.apache.org/docs/latest/programming-guide.html

https://spark.apache.org/docs/latest/api/python/

1TB free dataset from criteo : http://apache-spark-user-list.1001560.n3.nabble.com/Dataset-announcement-td22507.html

Week3

edx-spark-cs100.1x's People

Contributors

cmourouvin avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.