Giter VIP home page Giter VIP logo

data-collection's Introduction


PiC: A Phrase-in-Context Dataset for Phrase Understanding and Semantic Search

Joint work between Adobe Research and Auburn University

Thang Pham, Seunghyun Yoon, Trung Bui, and Anh Nguyen.

Table of Contents
  1. About The Project
  2. Getting Started
  3. Reproduce results for benchmark
  4. License
  5. Contact
  6. References

About The Project

Phrase in Context is a curated benchmark for phrase understanding and semantic search, consisting of three tasks of increasing difficulty: Phrase Similarity (PS), Phrase Retrieval (PR) and Phrase Sense Disambiguation (PSD). The datasets are annotated by 13 linguistic experts on Upwork and verified by two groups: ~1000 AMT crowdworkers and another set of 5 linguistic experts. PiC benchmark is distributed under CC-BY-NC 4.0.

PiC example

🌟 Official implementation for Data Collection in our paper PiC: A Phrase-in-Context Dataset for Phrase Understanding and Semantic Search.

🌞 Project Link: https://phrase-in-context.github.io/

🔥 Online Web Demo: https://aub.ie/phrase-search

If you use our PiC dataset or software, please consider citing:

@article{pham2022PiC,
  title={PiC: A Phrase-in-Context Dataset for Phrase Understanding and Semantic Search},
  author={Pham, Thang M and Yoon, Seunghyun and Bui, Trung and Nguyen, Anh},
  journal={arXiv preprint arXiv:2207.09068},
  year={2022}
}

Getting Started

Prerequisites

  • Anaconda 4.10 or higher
  • Python 3.9 or higher
  • pip version 21 or higher

Installation

  1. Create a new folder and clone the repo

    mkdir phrase-in-context && cd "$_"
    git clone https://github.com/Phrase-in-Context/data-collection.git && cd data-collection
  2. Create and activate a Conda environment

    conda create -n pic_construct python=3.9
    conda activate pic_construct
  3. Install required libraries

    pip install -r requirements.txt
    bash extra_requirements.sh
  4. Download & Prepare data

    bash prepare_data.sh

Data Collection

As there are no English dictionaries that contain sense inventories for multi-word noun phrases (mNPs), the key challenge of our data collection is to find such mNPs p that (1) have multiple senses (e.g. “massive figure” means a large number but also a huge physical shape, depending on the context); and (2) the context documents corresponding to those senses of p.

From a Wikipedia dump, we perform a 6-step procedure for mining a list of mNPs sorted descendingly by their likelihood of containing multiple senses. The most ambiguous 19,500 mNPs are then passed to experts for annotation and others for verification.

The table below is summary of our 3-stage data construction. p, s, m d, q, l denote target phrase, sentence, metadata, document, query, and label, respectively. In this repository, we only focus on 6-step Data Collection method to prepare data for annotation.

PiC construct

Please check out our paper for more details and examples.

Execute Step 1 and Step 2

  1. Download Wiki articles and remove empty articles
  2. Extract phrases (e.g., Noun, Proper Noun) along with their context sentences
python3 extract_wiki_phrases.py

Execute Step 3 to Step 6

  1. Remove phrases of a single context
  2. Find phrases of ambiguous words
  3. Find phrases in distinct contexts
    • Sort and filter by semantic dissimilarity
    • Sort by domain dissimilarity
  4. Select data for expert annotation
python3 find_ambiguous_phrases.py

Upon completion of 6 steps, we get the following file amt_data_19500_sorted_by_semantic_domain_latest.csv stored under data/preparation/ folder and ready to start the annotation phase.

We also benchmark the state-of-the-art models (e.g., PhraseBERT, DensePhrase, SimCSE) on our proposed tasks. Let's check it out here!

See the open issues for a full list of proposed features (and known issues).

License

Distributed under the MIT License.

Contact

The entire code was done and maintained by Thang Pham, @pmthangxai - [email protected]. Contact us via email or create github issues if you have any questions/requests. Thanks!

(back to top)

data-collection's People

Contributors

anguyen8 avatar thangpm avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.