Big Data Computing (2020-2021)

News

February 2021 Exam Session: Final Grades
Final grades are available at this link
February 2021 Exam Session: Project Presentation Schedule
Presentations of the 3 projects that have been accepted for oral discussion will take place remotely via Google Meet on a one-day session on Februart 10 at 9:00AM CET.
February 2021 Exam Session
Registrations to the February 2021 exam session are now open on Infostud (id 752692), and so they will until February 7, 2021. Project submission week opens up on February 1, 2021 at 00:00 CET (Central European Time) and closes on February 7, 2021 at 23:59 CET.
(Please, see the announcement below for additional details on how to submit your project during this session, which is the first one of the academic year 2020-21.)
Students who are planning to submit their project after the January 2021 session should refer to the Big Data Computing 2020-21 Moodle page, rather than the current one (i.e., Big Data Computing 2019-20). This is to align exam sessions to the correct academic year, since academic year 2019-20 formally ends on January, 31 2021. As such, starting from February 2021 until January 2022 all the exam sessions will be displayed on the newly created Moodle page indicated above, where students will be allowed to submit their work on the corresponding Project Submission Week that will be opened along the way, as usual.
(NOTE: Only students who expect to complete the exam in one of the upcoming 2020-21 sessions must subscribe to the Big Data Computing 2020-21 Moodle page!)

General Information

Welcome to the Big Data Computing class!

This is a first-year, second-semester course of the MSc in Computer Science of Sapienza University of Rome.

This repository contains class material along with any useful information for the 2020-2021 academic year.

Class Schedule

Tuesday from 5:00PM to 7:00PM
Wednesday from 4:00PM to 7:00PM

How to Attend Classes

According to the guidelines provided by Sapienza University to contrast the COVID-19 pandemic, the course will be held both in presence and remotely. For any further information, students must refer to the official documentation available on the Sapienza website.

Attending Classes in Presence: Room G50 - Building G, viale Regina Elena 295

Students who are willing to attend classes in presence must issue their request through the Prodigit Sapienza online booking system, according to the rules established (please, see here). Once the booking is confirmed - according to the class schedule above - students must go to Room G50, which is located on the 3rd floor of the Building G in viale Regina Elena 295.

Attending Classes Remotely: Zoom

Students who are willing to attend classes remotely online will need to register to the dedicated Zoom conference, using the following link: https://uniroma1.zoom.us/meeting/register/tZUtd-mupz8rGt3uK2Mz_cKmOGDyVQpNmMfm

Class Schedule

Tuesday from 5:00PM to 7:00PM (Room G50, 3rd Floor, Building G, viale Regina Elena 295)
Wednesday from 4:00PM to 7:00PM (Room G50, 3rd Floor, Building G, viale Regina Elena 295)

Office Hours

Tuesday from 2:00PM to 4:00PM, Room G39 located at the 2nd floor of Building G in viale Regina Elena 295.
(NOTE: Due to the COVID-19 emergency, office hours will be exclusively held online via Google Meet or Zoom upon email request message sent to the following address: [email protected])

Contacts

Email: [email protected]
Website: https://www.di.uniroma1.it/~tolomei
Bacheca Sapienza: https://corsidilaurea.uniroma1.it/it/users/gabrieletolomeiuniroma1it

Moodle Web Page

Students must subscribe to the Moodle web page using the same credentials (username/password) to access Wi-Fi network and Infostud services, at the following link: https://elearning.uniroma1.it/course/view.php?id=12771

Description and Goals

The amount, variety, and rate at which data is being generated nowadays both by humans and machines are unprecedented. This opens up a number of challenges on how to deal with those data, as traditional computing paradigms are not conceived to operate at such a scale.

"Big Data" is the umbrella term that has rapidly become popular to describe methodologies and tools specifically designed for collecting, storing, and processing very large or complex data sets. In addition to addressing foundational computer science problems, such as searching and sorting, big data computing mainly focuses on extracting knowledge - thereby value - from large-scale data sets using advanced data analysis techniques, such as machine learning.

This course is intended to provide graduate-level students with a deep understanding of programming models and tools that are suitable for the large-scale analysis of data distributed across clusters of computers. More specifically, the course will give students the ability to proficiently develop big data/machine learning solutions on top of industry standard frameworks, such as Hadoop and Spark, to tackle real-world problems faced by the so-called "Big Five" tech companies (i.e., Apple, Amazon, Google, Microsoft, and Facebook): text/graph analysis, classification/regression, and recommendation, just to name a few.

Prerequisites

The course assumes that students are familiar with the basics of data analysis and machine learning, properly supported by a strong knowledge of foundational concepts of calculus, linear algebra, and probability and statistics. In addition, students must have non-trivial computer programming skills (preferably using Python programming language). Previous experience with Hadoop, Spark, or distributed computing is not required.

Exams

Students must prove their level of comprehension of the subject by developing a software project, leveraging the set of methodologies and tools introduced during classes. Projects must of course refer to typical Big Data tasks: e.g., clustering, prediction, recommendation using very-large datasets in any application domain of interest. The topic of the project must anyway be agreed with the professor in advance; references where to select interesting projects from will be however suggested throughout the course (e.g., Kaggle). Projects can be done either individually or in group of at most 2 students, and they should be accompanied by a brief presentation written in english (e.g., a few PowerPoint slides). Finally, there will be an oral exam where submitted projects will be discussed in english; other questions on any topic addressed during the course may also be asked, but those can be answered either in english or in italian, as the student prefers.

Mining of Massive Datasets [Leskovec, Rajaraman, Ullman] available online.
Big Data Analysis with Python [Marin, Shukla, VK]
Large Scale Machine Learning with Python [Sjardin, Massaron, Boschetti]
Spark: The Definitive Guide [Chambers, Zaharia]
Learning Spark: Lightning-Fast Big Data Analysis [Karau, Konwinski, Wendell, Zaharia]
Hadoop: The Definitive Guide [White]
Python for Data Analysis [Mckinney]

Syllabus

[Tentative]

Introduction

The Big Data Phenomenon
The Big Data Infrastructure
- Distributed File Systems (HDFS)
- MapReduce (Hadoop)
- Spark
PySpark + Databricks

Unsupervised Learning: Clustering

Similarity Measures
Algorithms: K-means
Example: Document Clustering

Dimensionality Reduction

Feature Extraction
Algorithms: Principal Component Analysis (PCA)
Example: PCA + Handwritten Digit Recognition

Supervised Learning

Basics of Machine Learning
Regression/Classification
Algorithms: Linear Regression/Logistic Regression/Random Forest
Examples:
- Linear Regression -> House Pricing Prediction (i.e., predict the price which a house will be sold)
- Logistic Regression/Random Forest -> Marketing Campaign Prediction (i.e., predict whether a customer will subscribe a term deposit of a bank)

Recommender Systems

Content-based vs. Collaborative filtering
Algorithms: k-NN, Matrix Factorization (MF)
Example: Movie Recommender System (MovieLens)

Graph Analysis

Link Analysis
Algorithms: PageRank
Example: Ranking (a sample of) the Google Web Graph

Real-time Analytics

Streaming Data Processing
Example: Twitter Hate Speech Detector

Environment Setup

PySpark + Databricks [TBC]

In this course, we will be using the Python application programming interface to the Apache Spark framework (a.k.a. PySpark), in combination with Databricks. This will allows you to write and execute PySpark (as well as pure Python, for that matters) in your browser, with:

Zero configuration required;
Free access to Databricks' powerful cloud infrastructure (including GPUs);
Easy sharing.

Why Databricks?

Starting from this year, our Big Data Computing class at Sapienza has joined the Databricks University Alliance. This is a very active community of educators and faculty members who collaboratively share ideas, thoughts, and actual material on how to improve their teaching experience of Data-Science-like classes, which ultimately allow students to learn the latest data science tools used in the industry.

Where Should I Start with Databricks?

The first thing you have to do in order to start using Databricks is to set up a personal account. Databricks accounts come in two flavours:

Full Platform (payment, 14-day trial)
Community Edition (free)

The former is the standard payment account, which gives you access to the fully-fledged Databricks' data analytics platform based either on Microsoft Azure or Amazon AWS computational resources. The latter, instead, allows you to enjoy Databricks on Amazon AWS for free (of course with some limitations!)

For the aim of our class, students must all sign up for a personal Databricks Community Edition account using this link. Please, be sure to select the correct type of account, as highlighted in the snapshot below:

For any further information, please follow the instructions provided in the documentation.

What Databricks Resources Should I Use?

Many big companies have started relying on Databricks platform for running their data analytics tasks. As such, Databricks is really well-documented and provides you with a lot of useful material to consult. Among such material, I would suggest you to check out the following:

A self-paced training course, whose instructions on how to access it are available here
A four-part tutorial on data analyitics with Databricks
The official Databricks documentation

Optionally, you may also want to install PySpark on your own local machine.

(NOTE: This step is not required for passing this class)

Local Mode Setup [Optional]

In case you would like to install and configure PySpark also on your local machine, please follow the instructions described here. Note that those guidelines may refer to older (or, even worst, deprecated) versions of the required installation packages; please, see the official PySpark documentation for the the most updated installation instructions.

Class Schedules

Lecture #	Date	Topic	Material

Previous Years

In the following, you can quickly navigate through Big Data Computing class information and material from previous years.

NOTE: The folder containing the class material is unique and it is subject to changes and/or updates; as such, there may be differences between the content displayed on this website and what have been shown in class in the past.

2019-20

haluksumen / big-data-computing Goto Github PK

big-data-computing's Introduction

Big Data Computing (2020-2021)

News

General Information

Class Schedule

How to Attend Classes

Attending Classes in Presence: Room G50 - Building G, viale Regina Elena 295

Attending Classes Remotely: Zoom

Class Schedule

Office Hours

Contacts

Moodle Web Page

Description and Goals

Prerequisites

Exams

Recommended Textbooks

Syllabus

[Tentative]

Environment Setup

PySpark + Databricks [TBC]

Why Databricks?

Where Should I Start with Databricks?

What Databricks Resources Should I Use?

Local Mode Setup [Optional]

Class Schedules

Previous Years

Recommend Projects

Recommend Topics

Recommend Org