Giter VIP home page Giter VIP logo

dial's Introduction

DIAL

Implementation of

Traditional methods for Active Learning Pairwise classification tasks follow a pipeline as described: In each iteration, the learning algorithm (learner) learns a matcher (shown in an ellipse which we use to denote model components) from labeled data ๐‘‡, the labeled pairs collected from the (human) labeler so far, while the example selector (selector) chooses the most informative unlabeled pairs to acquire labels for. After including the new labels into ๐‘‡, the process repeats until we learn a matcher of sufficient quality.

Our proposed integrated matcher-blocker combination and new AL workflow as shown. Compared to the previous diagram, the two most notable differences are

  1. the blocker (dashed box) is now part of the AL feedback loop, and
  2. the matcher is a component within the blocker. As base matcher, we use transformer-based pretrained language models (TPLM) which have recently led to excellent ER accuracies in the passive (non-AL) settings.

Getting Started

Environment

This code has been tested on a machine with 64 2.10GHz Intel Xeon Silver 4216 CPUs with 1007GB RAM and a single NVIDIA Titan Xp 12 GB GPU with CUDA 10.2 running Ubuntu 18.04

Reproducing the Experiments

The first step is to get the data. We provide the data used in DeepMatcher experiments (Link1 Link2 Link3) The multilingual data can be downloaded from salesforce/localization-xml-mt

cd MultiLingual
git clone https://github.com/salesforce/localization-xml-mt.git

Now create a virtual environment using conda

conda create -n DIAL_env
conda activate DIAL_env
conda install -y -c conda-forge -c pytorch pytorch==1.6 cudatoolkit=10.2
pip install faiss-cpu transformers scikit-learn pandas 

Use run_single.sh to run DIAL. Example

bash run_single.sh DIAL amazon_google_exp 

To evaluate on Test, run

bash run_eval.sh Eval-Test DIAL amazon_google_exp 

and to evaluated on All Pairs, run

bash run_eval.sh Eval-AllPairs DIAL amazon_google_exp 

Currently supports : Walmart-Amazon, Amazon-Google, DBLP-ACM, DBLP-Google Scholar, Abt-Buy

To run experiments with the multilingual dataset,

cd MultiLingual
bash run_multilingual_expts.sh DIAL-Multilingual

Citation

If you use this code for your research, please consider citing our arXiv preprint

@misc{jain2021deep,
      title={Deep Indexed Active Learning for Matching Heterogeneous Entity Representations}, 
      author={Arjit Jain and Sunita Sarawagi and Prithviraj Sen},
      year={2021},
      eprint={2104.03986},
      archivePrefix={arXiv},
      primaryClass={cs.DB}
}

References:

  1. https://github.com/megagonlabs/ditto
  2. https://github.com/brunnurs/entity-matching-transformer
  3. https://github.com/anhaidgroup/deepmatcher
  4. https://github.com/JordanAsh/badge

dial's People

Contributors

arjitj avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.