PiC: A Phrase-in-Context Dataset for Phrase Understanding and Semantic Search

Joint work between Adobe Research and Auburn University

Thang Pham, Seunghyun Yoon, Trung Bui, and Anh Nguyen.

Table of Contents

About The Project
Getting Started
- Prerequisites
- Installation
Reproduce results for benchmark
License
Contact
References

About The Project

Phrase in Context is a curated benchmark for phrase understanding and semantic search, consisting of three tasks of increasing difficulty: Phrase Similarity (PS), Phrase Retrieval (PR) and Phrase Sense Disambiguation (PSD). The datasets are annotated by 13 linguistic experts on Upwork and verified by two groups: ~1000 AMT crowdworkers and another set of 5 linguistic experts. PiC benchmark is distributed under CC-BY-NC 4.0.

🌟 Official implementation for Data Collection in our paper PiC: A Phrase-in-Context Dataset for Phrase Understanding and Semantic Search.

🌞 Project Link: https://phrase-in-context.github.io/

🔥 Online Web Demo: https://aub.ie/phrase-search

If you use our PiC dataset or software, please consider citing:

@article{pham2022PiC,
  title={PiC: A Phrase-in-Context Dataset for Phrase Understanding and Semantic Search},
  author={Pham, Thang M and Yoon, Seunghyun and Bui, Trung and Nguyen, Anh},
  journal={arXiv preprint arXiv:2207.09068},
  year={2022}
}

Getting Started

Prerequisites

Anaconda 4.10 or higher
Python 3.9 or higher
pip version 21 or higher

Installation

Create a new folder and clone the repo

mkdir phrase-in-context && cd "$_"
git clone https://github.com/Phrase-in-Context/data-collection.git && cd data-collection

Create and activate a Conda environment

conda create -n pic_construct python=3.9
conda activate pic_construct

Install required libraries

pip install -r requirements.txt
bash extra_requirements.sh

Download & Prepare data
```
bash prepare_data.sh
```

Data Collection

As there are no English dictionaries that contain sense inventories for multi-word noun phrases (mNPs), the key challenge of our data collection is to find such mNPs p that (1) have multiple senses (e.g. “massive figure” means a large number but also a huge physical shape, depending on the context); and (2) the context documents corresponding to those senses of p.

From a Wikipedia dump, we perform a 6-step procedure for mining a list of mNPs sorted descendingly by their likelihood of containing multiple senses. The most ambiguous 19,500 mNPs are then passed to experts for annotation and others for verification.

The table below is summary of our 3-stage data construction. p, s, m d, q, l denote target phrase, sentence, metadata, document, query, and label, respectively. In this repository, we only focus on 6-step Data Collection method to prepare data for annotation.

Please check out our paper for more details and examples.

Execute Step 1 and Step 2

Download Wiki articles and remove empty articles
Extract phrases (e.g., Noun, Proper Noun) along with their context sentences

python3 extract_wiki_phrases.py

Execute Step 3 to Step 6

Remove phrases of a single context
Find phrases of ambiguous words
Find phrases in distinct contexts
- Sort and filter by semantic dissimilarity
- Sort by domain dissimilarity
Select data for expert annotation

python3 find_ambiguous_phrases.py

Upon completion of 6 steps, we get the following file amt_data_19500_sorted_by_semantic_domain_latest.csv stored under data/preparation/ folder and ready to start the annotation phase.

We also benchmark the state-of-the-art models (e.g., PhraseBERT, DensePhrase, SimCSE) on our proposed tasks. Let's check it out here!

See the open issues for a full list of proposed features (and known issues).

License

Distributed under the MIT License.

Contact

The entire code was done and maintained by Thang Pham, @pmthangxai - [email protected]. Contact us via email or create github issues if you have any questions/requests. Thanks!

(back to top)

thangpm / data-collection Goto Github PK

data-collection's Introduction

PiC: A Phrase-in-Context Dataset for Phrase Understanding and Semantic Search

About The Project

Getting Started

Prerequisites

Installation

Data Collection

Execute Step 1 and Step 2

Execute Step 3 to Step 6

License

Contact

data-collection's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent