The gsm_repro from gaglia88

Reproducibility experiments for Generalized Supervised Meta-blocking

This repository contains the code to reproduce the experiments performed in [1].

Structure

datasets contains the datasets used in the project. They will be downloaded at the first execution of run_all_tests.sh script. It is possible to edit the file datasets/datasets.json to add/remove them. For each dataset, the parameter purging_threshold (default value 1.0) affects the number of retained comparisons, decreasing it will discard more comparisons, but reducing recall. It can be lowered for the larger datasets if the available RAM is not enough (see also requirements section).
config contains the list of available feature sets and a configuration file. The configuration file lets to set the maximum memory for Spark execution, the number of parallel processes that run at a time in the different experiments, and the number of repetitions for each experiment.
docker contains the docker files to create a machine to reproduce the main experiments.
java contains the Java code, needed for the experiments.
python contains the Python code, needed for the experiments.
scala contains the Scala code, needed for the experiments.
notebooks contains the notebooks needed to generate the charts/tables of the paper.
comparison contains the code needed to reproduce the comparison with other frameworks (Table 5 of the paper).

Requirements

The original experiments were performed on a machine with four Intel Xeon E5-2697 2.40 GHz (72 cores), 216 GB of RAM, running Ubuntu 18.04. With less memory, the largest datasets could cause out-of-memory issues. In particular, 200K and 300K datasets could cause this issue, a solution to test everything is to lower the purging_threshold parameter in the file datasets/datasets.json. This will cause a more aggressive pruning, reducing the number of retained comparisons, but also recall.
Docker is needed to run the experiments.
To perform the comparison with the other tools (Table 5 in the paper) a GPU is needed. Also Docker need to be configured to use it, this require to install NVIDIA Container Toolkit.

Reproduce the experiments

1. Comparison with other frameworks

In the paper we compared the performance of Generalized Supervised Meta-blocking with Sudowoodo, ContextualBlocker, DeepBlocker and Sparkly. To perform this comparison, a different docker machine is needed, due to the heterogeneity of the different setups.

Step 1

Move in the comparison folder and run the command start_docker.sh$${\color{red}*}$$, this will create and start a new docker machine called gsm_comparison.

Step 2

Inside the docker machine, move in the comparison folder and run the script run_all_exp.sh, this will process all datasets with the four frameworks.

Step 3

Close the docker machine and proceed with the main experiments.

2. Main experiments

Step 1

Open the file config/config.ini and set, based on your system,

max_memory is the maximum memory used by Apache Spark.
parallel_process is the number of parallel processes that run in some experiments. A higher value requires more memory, so it can cause out-of-memory problems.
repetitions number of repetitions for each experiment. The default value is 10, a lower value requires less time to complete all the experiments. Do not touch the other parameters in the configuration file.

Step 2

Open the file run_all_tests.sh and set the MAX_MEMORY parameter based on your system configuration. This is the maximum memory that can be used by the Java Virtual Machine.

Step 3

Open a shell, move to the docker folder, and run the script start_docker.sh$${\color{red}*}$$. The script will run the docker-machine and login inside it.

Step 4

Inside the docker machine, run the script run_all_tests.sh. It will run all the experiments. The resulting files will be placed inside the results folder.

Step 5

When all the experiments are completed, run the script start_notebook.sh. This will start Jupyter inside the docker-machine which has a port forwarding on port 8888. Opening from the browser your machine ip:8888 will open the notebook environment. Running the notebook it is possible to reproduce all the Figures/Tables of the paper [1].

$${\color{red}*}$$ Note

If the script is run on a remote machine, it will stop running when the session ends (e.g., if the connection to the machine is lost, or the terminal closed). For this reason, it is necessary to use an intermediary tool to let the process be detached from the current session. In our experiments, we employed the screen command. You can read here how to use it. Do not use the nohup instruction because it does not let to connect to the docker machine when the script is executed.

References

[1] Gagliardelli, L., Papadakis, G., Simonini, G., Bergamaschi, S., & Palpanas, T. (2023). GSM: A generalized approach to Supervised Meta-blocking for scalable entity resolution. Information Systems, 102307.

gaglia88 / gsm_repro Goto Github PK

gsm_repro's Introduction

Reproducibility experiments for Generalized Supervised Meta-blocking

Structure

Requirements

Reproduce the experiments

1. Comparison with other frameworks

Step 1

Step 2

Step 3

2. Main experiments

Step 1

Step 2

Step 3

Step 4

Step 5

$${\color{red}*}$$ Note

References

gsm_repro's People

Contributors

Watchers

Recommend Projects

Recommend Topics

Recommend Org