Giter VIP home page Giter VIP logo

lefteris-souflas / entity-resolution Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 4.65 MB

Addressed Entity Resolution challenges. Tasks include schema-agnostic blocking, pairwise comparisons, Meta-Blocking graph construction, and Jaccard similarity computation. Deliverables include source code, reports, and reproducibility guidelines in Python

License: MIT License

Python 23.74% Jupyter Notebook 76.26%
entity-resolution meta-blocking token-blocking graph jaccard-similarity edge-pruning

entity-resolution's Introduction

Entity Resolution

Assignment 2 for the Advanced Data Engineering Course of AUEB's MSc in Business Analytics

Introduction

This assignment tackles various challenges in Entity Resolution using the provided ER-Data.csv file. Tasks A to D collectively aim to enhance data quality and accuracy through schema-agnostic methods, pairwise comparisons, Meta-Blocking graph construction, and similarity computation.

Task A [30 points]

  • Implement Token Blocking method as a schema-agnostic approach.
  • Generate blocks in the form of K-V (Key-value) pairs.
  • Use all attributes (except id) for creating blocks.
  • Ensure accurate matching by transforming strings to lowercase during token creation and filtering out stop-words.
  • Pretty-print the index for clear readability.

Task B [25 points]

  • Compute all possible comparisons to resolve duplicates within the created blocks.
  • Print the final calculated number of comparisons.

Task C [30 points]

  • Create a Meta-Blocking graph of the block collection from Task A.
  • Utilize the CBS Weighting Scheme to refine the graph.
  • Prune edges with weight < 2 to reduce unnecessary comparisons.
  • Re-calculate the final number of comparisons after edge pruning.

Task D [15 points]

  • Develop a function to compute Jaccard similarity based on the 'title' attribute.
  • The function takes two entities as input and computes their similarity.
  • No actual comparisons using this function are required.

Deliverables:

  1. Source code with useful comments.
  2. A small report for each task justifying the code and describing the methodology.
  3. For Task C ONLY, a partially solved answer with proper justification will also be accepted.
  4. Programming Languages: Python was used.

Code Reproducibility

Ensuring the reproducibility of the results presented in this report is of paramount importance. To facilitate the readers' ability to reproduce the outcomes, the following steps provide guidance on accessing, setting up, and executing the code.

Accessing the Code

The complete code used for Tasks A, B, C, and D is available in the child folders Code and Jupyter of the root folder. Readers are encouraged to download the code files from the provided source.

Environment Setup

Depending on the specific tasks and functions, certain libraries and dependencies are required. Ensure that you have the necessary libraries installed.

Executing the Code

The code can be executed in a Jupyter Notebook for the ipynb file or any Python environment for the py file. Open the respective code file for each task and follow the instructions within the comments.

Task-Specific Instructions

For the assignment’s Tasks, refer to the corresponding sections in the Jupyter Notebook code or the exported PDF file (if unable to run the ipynb file) for an in-depth explanation of the code and the methodology used. This report presents only a summary justification of the methodology and code used. The code is designed to be modular and organized, making it straightforward to follow along and reproduce the results.

Note: Ensure that the ER-Data.csv file is placed in the Data directory before running the code.

By following these steps, readers can confidently reproduce the presented results and gain a deeper understanding of the methodologies applied in this study.

entity-resolution's People

Contributors

lefteris-souflas avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.