This is an exercise in the context of the subject “Decentralized Technologies”, that take place in the master’s degree program "Data and Web Science" of Aristotle University of Thessaloniki.
Given a potentially large set of d-dimensional points, where each point is represented as a d-dimensional vector, we need to detect interesting points. The project is based on the concept of dominance. We say that a point p dominates another point q, when p is as good as q in all dimensions and it is strictly better in at least one dimension. We will assume that small values are preferable. For example, the point p(1, 2) dominates q(3, 4) since 1 < 3 and 2 < 4. Also, p(1, 2) dominates q(1, 3) since although they have the same x coordinate, the y coordinate of p is smaller than that of q. There are three different tasks you need to complete:
- Task1. Given a set of d-dimensional points, return the set of points that are not dominated. This is also known as the skyline set.
- Task2. Given a set of d-dimensional points, return the k points with the highest dominance score. The dominance score of a point p is defined as the total number of points dominated by p.
- Task3. Given a set of d-dimensional points, return the k points from the skyline with the highest dominance score.
They are generated by the pythons scripts that can be found here
The program is reading the file settings.json that must be at the root of the project. This JSON file contains all the parameters needed. The settings.json file has the format below:
{
"description":"Distribution: Corelated, Points: 1000, Dimensions: 2, Generated with entropy: 0.5",
"cores":4,
"testName":"anti-corelated.s1000.e0.5.d2",
"dataFile":"datasets/&NAME&.csv",
"task1ResultsOutput":"results/&NAME&/task1.csv",
"task2ResultsOutput":"results/&NAME&/task2.csv",
"task3ResultsOutput":"results/&NAME&/task3.csv",
"topKpoints":10,
"executeTask2":true,
"executeTask3":true
}
The place holder &NAME& is used in the paths for the value of the testName property. The properties "cores", "testName" and "topKpoints" can be provided as parameters to the DominanceQueries script. To execute the DominanceQueries script : java -jar DominanceQueries.jar settings_json_path test_case_index_of_json_file test_name top_k_points cpu_cores
All the arguments are optional and the default behavior is to load the setting.json file from the execution path, get the first test case with index 0 and load the rest of the arguments from the test case properties.
A simple python script, visualize.py, was created in order to plot together two datasets in the same scatter plot with different colors. We do that in order to visualize the results of the different tasks in a plot.
Usage:
Arguments:
-h, --help
show this help message and exit
-d DATA, --data DATA
Data to plot
-l HIGHLIGHT, --highlight HIGHLIGHT
Data to highlight
-s SAMPLES, --samples SAMPLES
Samples to visualise. Set 0 to use all of them.
-o OUTPUT, --output OUTPUT
Define where to save the plot, if not provided it is not saved
The python script, bruteforce.py, was created in order to get the results of the required tasks using brute-force method. We do that in order to validate the results of the spark implementation. Please don't run this script with many data because your PC will explode.
Usage:
Arguments:
-h, --help
show this help message and exit
-d DATA, --data DATA
Input data to process
-t TOP, --top TOP
Number of top points in terms of dominations