Giter VIP home page Giter VIP logo

big-data-computing's Introduction

Big Data Computing (2020-2021)

News | General Information | Syllabus | Environment Setup | Class Schedules | Previous Years

News

  • February 2021 Exam Session: Final Grades
    Final grades are available at this link
  • February 2021 Exam Session: Project Presentation Schedule
    Presentations of the 3 projects that have been accepted for oral discussion will take place remotely via Google Meet on a one-day session on Februart 10 at 9:00AM CET.
  • February 2021 Exam Session
    Registrations to the February 2021 exam session are now open on Infostud (id 752692), and so they will until February 7, 2021. Project submission week opens up on February 1, 2021 at 00:00 CET (Central European Time) and closes on February 7, 2021 at 23:59 CET.
    (Please, see the announcement below for additional details on how to submit your project during this session, which is the first one of the academic year 2020-21.)
  • Students who are planning to submit their project after the January 2021 session should refer to the Big Data Computing 2020-21 Moodle page, rather than the current one (i.e., Big Data Computing 2019-20). This is to align exam sessions to the correct academic year, since academic year 2019-20 formally ends on January, 31 2021. As such, starting from February 2021 until January 2022 all the exam sessions will be displayed on the newly created Moodle page indicated above, where students will be allowed to submit their work on the corresponding Project Submission Week that will be opened along the way, as usual.
    (NOTE: Only students who expect to complete the exam in one of the upcoming 2020-21 sessions must subscribe to the Big Data Computing 2020-21 Moodle page!)

General Information

Welcome to the Big Data Computing class!

This is a first-year, second-semester course of the MSc in Computer Science of Sapienza University of Rome.

This repository contains class material along with any useful information for the 2020-2021 academic year.

Class Schedule

  • Tuesday from 5:00PM to 7:00PM
  • Wednesday from 4:00PM to 7:00PM

How to Attend Classes

According to the guidelines provided by Sapienza University to contrast the COVID-19 pandemic, the course will be held both in presence and remotely. For any further information, students must refer to the official documentation available on the Sapienza website.

Attending Classes in Presence: Room G50 - Building G, viale Regina Elena 295

Students who are willing to attend classes in presence must issue their request through the Prodigit Sapienza online booking system, according to the rules established (please, see here). Once the booking is confirmed - according to the class schedule above - students must go to Room G50, which is located on the 3rd floor of the Building G in viale Regina Elena 295.

Attending Classes Remotely: Zoom

Students who are willing to attend classes remotely online will need to register to the dedicated Zoom conference, using the following link: https://uniroma1.zoom.us/meeting/register/tZUtd-mupz8rGt3uK2Mz_cKmOGDyVQpNmMfm

Class Schedule

  • Tuesday from 5:00PM to 7:00PM (Room G50, 3rd Floor, Building G, viale Regina Elena 295)
  • Wednesday from 4:00PM to 7:00PM (Room G50, 3rd Floor, Building G, viale Regina Elena 295)

Office Hours

  • Tuesday from 2:00PM to 4:00PM, Room G39 located at the 2nd floor of Building G in viale Regina Elena 295.
    (NOTE: Due to the COVID-19 emergency, office hours will be exclusively held online via Google Meet or Zoom upon email request message sent to the following address: [email protected])

Contacts

Moodle Web Page

Students must subscribe to the Moodle web page using the same credentials (username/password) to access Wi-Fi network and Infostud services, at the following link: https://elearning.uniroma1.it/course/view.php?id=12771

Description and Goals

The amount, variety, and rate at which data is being generated nowadays both by humans and machines are unprecedented. This opens up a number of challenges on how to deal with those data, as traditional computing paradigms are not conceived to operate at such a scale.

"Big Data" is the umbrella term that has rapidly become popular to describe methodologies and tools specifically designed for collecting, storing, and processing very large or complex data sets. In addition to addressing foundational computer science problems, such as searching and sorting, big data computing mainly focuses on extracting knowledge - thereby value - from large-scale data sets using advanced data analysis techniques, such as machine learning.

This course is intended to provide graduate-level students with a deep understanding of programming models and tools that are suitable for the large-scale analysis of data distributed across clusters of computers. More specifically, the course will give students the ability to proficiently develop big data/machine learning solutions on top of industry standard frameworks, such as Hadoop and Spark, to tackle real-world problems faced by the so-called "Big Five" tech companies (i.e., Apple, Amazon, Google, Microsoft, and Facebook): text/graph analysis, classification/regression, and recommendation, just to name a few.

Prerequisites

The course assumes that students are familiar with the basics of data analysis and machine learning, properly supported by a strong knowledge of foundational concepts of calculus, linear algebra, and probability and statistics. In addition, students must have non-trivial computer programming skills (preferably using Python programming language). Previous experience with Hadoop, Spark, or distributed computing is not required.

Exams

Students must prove their level of comprehension of the subject by developing a software project, leveraging the set of methodologies and tools introduced during classes. Projects must of course refer to typical Big Data tasks: e.g., clustering, prediction, recommendation using very-large datasets in any application domain of interest. The topic of the project must anyway be agreed with the professor in advance; references where to select interesting projects from will be however suggested throughout the course (e.g., Kaggle). Projects can be done either individually or in group of at most 2 students, and they should be accompanied by a brief presentation written in english (e.g., a few PowerPoint slides). Finally, there will be an oral exam where submitted projects will be discussed in english; other questions on any topic addressed during the course may also be asked, but those can be answered either in english or in italian, as the student prefers.

Recommended Textbooks

No textbooks are mandatory to successfully follow this course. However, there is a huge set of references which may be worth mentioning, especially to those who wants to dig deeper into some specific topics. Among those, some readings I would like to suggest are as follows:

  • Mining of Massive Datasets [Leskovec, Rajaraman, Ullman] available online.
  • Big Data Analysis with Python [Marin, Shukla, VK]
  • Large Scale Machine Learning with Python [Sjardin, Massaron, Boschetti]
  • Spark: The Definitive Guide [Chambers, Zaharia]
  • Learning Spark: Lightning-Fast Big Data Analysis [Karau, Konwinski, Wendell, Zaharia]
  • Hadoop: The Definitive Guide [White]
  • Python for Data Analysis [Mckinney]

Syllabus

[Tentative]

Introduction

  • The Big Data Phenomenon
  • The Big Data Infrastructure
    • Distributed File Systems (HDFS)
    • MapReduce (Hadoop)
    • Spark
  • PySpark + Databricks

Unsupervised Learning: Clustering

  • Similarity Measures
  • Algorithms: K-means
  • Example: Document Clustering

Dimensionality Reduction

  • Feature Extraction
  • Algorithms: Principal Component Analysis (PCA)
  • Example: PCA + Handwritten Digit Recognition

Supervised Learning

  • Basics of Machine Learning
  • Regression/Classification
  • Algorithms: Linear Regression/Logistic Regression/Random Forest
  • Examples:
    • Linear Regression -> House Pricing Prediction (i.e., predict the price which a house will be sold)
    • Logistic Regression/Random Forest -> Marketing Campaign Prediction (i.e., predict whether a customer will subscribe a term deposit of a bank)

Recommender Systems

  • Content-based vs. Collaborative filtering
  • Algorithms: k-NN, Matrix Factorization (MF)
  • Example: Movie Recommender System (MovieLens)

Graph Analysis

  • Link Analysis
  • Algorithms: PageRank
  • Example: Ranking (a sample of) the Google Web Graph

Real-time Analytics

  • Streaming Data Processing
  • Example: Twitter Hate Speech Detector

Environment Setup

PySpark + Databricks [TBC]

In this course, we will be using the Python application programming interface to the Apache Spark framework (a.k.a. PySpark), in combination with Databricks. This will allows you to write and execute PySpark (as well as pure Python, for that matters) in your browser, with:

  • Zero configuration required;
  • Free access to Databricks' powerful cloud infrastructure (including GPUs);
  • Easy sharing.

Why Databricks?

Starting from this year, our Big Data Computing class at Sapienza has joined the Databricks University Alliance. This is a very active community of educators and faculty members who collaboratively share ideas, thoughts, and actual material on how to improve their teaching experience of Data-Science-like classes, which ultimately allow students to learn the latest data science tools used in the industry.

Where Should I Start with Databricks?

The first thing you have to do in order to start using Databricks is to set up a personal account. Databricks accounts come in two flavours:

  • Full Platform (payment, 14-day trial)
  • Community Edition (free)

The former is the standard payment account, which gives you access to the fully-fledged Databricks' data analytics platform based either on Microsoft Azure or Amazon AWS computational resources. The latter, instead, allows you to enjoy Databricks on Amazon AWS for free (of course with some limitations!)

For the aim of our class, students must all sign up for a personal Databricks Community Edition account using this link. Please, be sure to select the correct type of account, as highlighted in the snapshot below:

Databricks Account Sign Up

For any further information, please follow the instructions provided in the documentation.

What Databricks Resources Should I Use?

Many big companies have started relying on Databricks platform for running their data analytics tasks. As such, Databricks is really well-documented and provides you with a lot of useful material to consult. Among such material, I would suggest you to check out the following:

Optionally, you may also want to install PySpark on your own local machine.

(NOTE: This step is not required for passing this class)

Local Mode Setup [Optional]

In case you would like to install and configure PySpark also on your local machine, please follow the instructions described here. Note that those guidelines may refer to older (or, even worst, deprecated) versions of the required installation packages; please, see the official PySpark documentation for the the most updated installation instructions.


Class Schedules

Lecture # Date Topic Material

Previous Years

In the following, you can quickly navigate through Big Data Computing class information and material from previous years.

NOTE: The folder containing the class material is unique and it is subject to changes and/or updates; as such, there may be differences between the content displayed on this website and what have been shown in class in the past.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.