Homeworks for the Big Data Course 2022/23, Computer Engineering Master's Degree @ Unipd
- Organization of the Repository
- Homework 1 - Triangle Counting
- Homework 2 - Triangle Counting in CloudVeneto Cluster
- Homework 3 - Count Sketch with Spark Streaming
hw1
: datasets, output and code for the first homeworkhw2
: datasets, output and code for the second homeworkhw3
: datasets, output and code for the third homework
In the homework you must implement and test in Spark two MapReduce algorithms to count the number of distinct triangles in an undirected graph
Both algorithms use an integer parameter
Define a hash function โ๐ถ which maps each vertex ๐ข in ๐ into a color
- Create
$C$ subsets of edges, where, for$0\leq i < C$ , the i-th subset,$E(i)$ consist of all edges$(u,v)\in E$ such that$h_C(u)=h_C(v)=i$ . Note that if the two endpoints of an edge have different colors, the edge does not belong to any$E(i)$ and will be ignored by the algorithm. - Compute the number
$t(i)$ triangles formed by edges of$E(i)$ , separately for each$0\leq i < C$ .
- Compute and return
$t_{final}=C^2\sum_{0\leq i < C}t(i)$ as final estimate of the number of triangles in$G$ .
In the homework you must develop an implementation of this algorithm as a method/function MR_ApproxTCwithNodeColors
.
- Partition the edges at random into
$C$ subsets$E(0),E(1),...,E(C-1)$ . Note that, unlike the previous algorithm, now every edge ends up in some $E(i). - Compute the number
$t(i)$ of triangles formed by edges of$E(i)$ , separately for each$0\le i < C$ .
- Compute and return
$t_{final}=C^2\sum_{0\leq i < C}t(i)$ as final estimate of the number of triangles in G.
In the homework you must develop an implementation of this algorithm as a method/function MR_ApproxTCwithSparkPartitions
that, in Round 1, uses the partitions provided by Spark, which you can access through method mapPartitions.
To implement the algorithms assume that the vertices (set
In this homework, you will run a Spark program on the CloudVeneto cluster. As for Homework 1, the objective is to estimate (approximately or exactly) the number of triangles in an undirected graph ๐บ=(๐,๐ธ). More specifically, your program must implement two algorithms:
The same as Algorithm 1 in Homework 1.
A 2-round MapReduce algorithm which returns the exact number of triangles. The algorithm is based on node colors (as Algorithm 1) and works as follows:
Let
- For each edge
$(u,v)\in E$ separately create$C$ key-value pairs$(k_i,(u,v))$ with$i=0,1,...,C-1$ where each key$k_i$ is a triplet containing the three colors$h_C(u),h_C(v),i$ sorted in non-decreasing order. - For each key
$k=(x,y,z)$ let$L_k$ be the list of values (i.e., edges) of intermediate pairs with key$k$ . Compute the number$t_k$ of triangles formed by the edges of$L_k$ whose node colors, in sorted order, are$x,y,z$ . Note that the edges of$L_k$ may form also triangles whose node colors are not the correct ones: e.g.,$(x,y,y)$ with$y\ne z$ .
Compute and output the sum of all
You must write a program which receives in input the following 6 command-line arguments (in the given order):
- An integer D: the number of rows of the count sketch
- An integer W: the number of columns of the count sketch
- An integer left: the left endpoint of the interval of interest
- An integer right: the right endpoint of the interval of interest
- An integer K: the number of top frequent items of interest
- An integer portExp: the port number
The program must read the first (approximately) 10M items of the stream
$\Sigma$ generated from the remote machine at port portExp and compute the following statistics. Let R denote the interval [left,right] and let$\Sigma_R$ be the substream consisting of all items of$\Sigma$ belonging to$R$ .
The program must compute:
- A
$D\times W$ count sketch for$\Sigma_R$ - The exact frequencies of all distinct items of
$\Sigma_R$ - The true second moment
$F_2$ of |\Sigma_R|. To avoid large numbers, normalize$F_2$ by dividing it by$|\Sigma_R|^2$ . - The approximate second moment
$\tilde{F}_2$ of$\Sigma_R$ using count sketch, also normalized by dividing it by$|\Sigma_R|^2$ . - The average relative error of the frequency estimates provided by the count sketch where the average is computed over the items of
$u\in \Sigma_R$ whose true frequency is$f_u\ge \phi(K)$ , where$\phi(K)$ is the$K$ -th largest frequency of the items of$\Sigma_R$ . Recall that if$\tilde{f}_u$ is the estimated frequency for$u$ , the relative error of is$|f_u-\tilde{f}_u| / f_u$ .