Giter VIP home page Giter VIP logo

bigdata_homeworks's Introduction

Big Data 2022/23 Homeworks

Homeworks for the Big Data Course 2022/23, Computer Engineering Master's Degree @ Unipd

Organization of the Repository

  • hw1: datasets, output and code for the first homework
  • hw2: datasets, output and code for the second homework
  • hw3: datasets, output and code for the third homework

Homework 1 - Triangle Counting

In the homework you must implement and test in Spark two MapReduce algorithms to count the number of distinct triangles in an undirected graph $G=(V,E)$, where a triangle is defined by 3 vertices $u,v,w\in V$, such that $(u,v),(v,w),(w,u)\in E$.

image

Both algorithms use an integer parameter $C\ge 1$, which is used to partition the data.

ALGORITHM 1:

Define a hash function โ„Ž๐ถ which maps each vertex ๐‘ข in ๐‘‰ into a color $h_C(u)$ in $[0,C-1]$. To this purpose, we advise you to use the hash function $$h_C(u)=(((a\cdot u+b)mod: p)mod C)$$ where $p=8191$ (which is prime), $a$ is a random integer in $[1,p-1]$, and $b$ is a random integer in $[0,p-1]$.

Round 1:
  • Create $C$ subsets of edges, where, for $0\leq i < C$, the i-th subset, $E(i)$ consist of all edges $(u,v)\in E$ such that $h_C(u)=h_C(v)=i$. Note that if the two endpoints of an edge have different colors, the edge does not belong to any $E(i)$ and will be ignored by the algorithm.
  • Compute the number $t(i)$ triangles formed by edges of $E(i)$, separately for each $0\leq i < C$.
Round 2:
  • Compute and return $t_{final}=C^2\sum_{0\leq i < C}t(i)$ as final estimate of the number of triangles in $G$.

In the homework you must develop an implementation of this algorithm as a method/function MR_ApproxTCwithNodeColors.

ALGORITHM 2:

Round 1:
  • Partition the edges at random into $C$ subsets $E(0),E(1),...,E(C-1)$. Note that, unlike the previous algorithm, now every edge ends up in some $E(i).
  • Compute the number $t(i)$ of triangles formed by edges of $E(i)$, separately for each $0\le i < C$.
Round 2:
  • Compute and return $t_{final}=C^2\sum_{0\leq i < C}t(i)$ as final estimate of the number of triangles in G.

In the homework you must develop an implementation of this algorithm as a method/function MR_ApproxTCwithSparkPartitions that, in Round 1, uses the partitions provided by Spark, which you can access through method mapPartitions.

DATA FORMAT.

To implement the algorithms assume that the vertices (set $V$) are represented as 32-bit integers, and that the graph $G$ is given in input as the set of edges $E$ stored in a file. Each row of the file contains one edge stored as two integers (the edge's endpoints) separated by comma (','). Each edge of $E$ appears exactly once in the file and $E$ does not contain multiple copies of the same edge.

Homework 2 - Triangle Counting in CloudVeneto Cluster

In this homework, you will run a Spark program on the CloudVeneto cluster. As for Homework 1, the objective is to estimate (approximately or exactly) the number of triangles in an undirected graph ๐บ=(๐‘‰,๐ธ). More specifically, your program must implement two algorithms:

ALGORITHM 1:

The same as Algorithm 1 in Homework 1.

ALGORITHM 2:

A 2-round MapReduce algorithm which returns the exact number of triangles. The algorithm is based on node colors (as Algorithm 1) and works as follows: Let $C\ge 1$ be the number of colors and let $h_C(.)$ bee the hash function that assigns a color to each node used in Algorithm 1.

Round 1
  • For each edge $(u,v)\in E$ separately create $C$ key-value pairs $(k_i,(u,v))$ with $i=0,1,...,C-1$ where each key $k_i$ is a triplet containing the three colors $h_C(u),h_C(v),i$ sorted in non-decreasing order.
  • For each key $k=(x,y,z)$ let $L_k$ be the list of values (i.e., edges) of intermediate pairs with key $k$. Compute the number $t_k$ of triangles formed by the edges of $L_k$ whose node colors, in sorted order, are $x,y,z$. Note that the edges of $L_k$ may form also triangles whose node colors are not the correct ones: e.g., $(x,y,y)$ with $y\ne z$.
Round 2.

Compute and output the sum of all $t_k$'s determined in Round 1. It is easy to see that every triangle in the graph ๐บ is counted exactly once in the sum. You can assume that the total number of $t_k$'s is small, so that they can be garthered in a local structure. Alternatively, you can use some ready-made reduce method to do the sum. Both approaches are fine.

Homework 3 - Count Sketch with Spark Streaming

You must write a program which receives in input the following 6 command-line arguments (in the given order):

  • An integer D: the number of rows of the count sketch
  • An integer W: the number of columns of the count sketch
  • An integer left: the left endpoint of the interval of interest
  • An integer right: the right endpoint of the interval of interest
  • An integer K: the number of top frequent items of interest
  • An integer portExp: the port number The program must read the first (approximately) 10M items of the stream $\Sigma$ generated from the remote machine at port portExp and compute the following statistics. Let R denote the interval [left,right] and let $\Sigma_R$ be the substream consisting of all items of $\Sigma$ belonging to $R$.

The program must compute:

  • A $D\times W$ count sketch for $\Sigma_R$
  • The exact frequencies of all distinct items of $\Sigma_R$
  • The true second moment $F_2$ of |\Sigma_R|. To avoid large numbers, normalize $F_2$ by dividing it by $|\Sigma_R|^2$.
  • The approximate second moment $\tilde{F}_2$ of $\Sigma_R$ using count sketch, also normalized by dividing it by $|\Sigma_R|^2$.
  • The average relative error of the frequency estimates provided by the count sketch where the average is computed over the items of $u\in \Sigma_R$ whose true frequency is $f_u\ge \phi(K)$, where $\phi(K)$ is the $K$-th largest frequency of the items of $\Sigma_R$. Recall that if $\tilde{f}_u$ is the estimated frequency for $u$, the relative error of is $|f_u-\tilde{f}_u| / f_u$.

bigdata_homeworks's People

Contributors

enricobolzonello avatar francescovisentin avatar albertob13 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.