Joint work between Adobe Research and Auburn University
Thang Pham, Seunghyun Yoon, Trung Bui, and Anh Nguyen.
Table of Contents
Phrase in Context is a curated benchmark for phrase understanding and semantic search, consisting of three tasks of increasing difficulty: Phrase Similarity (PS), Phrase Retrieval (PR) and Phrase Sense Disambiguation (PSD). The datasets are annotated by 13 linguistic experts on Upwork and verified by two groups: ~1000 AMT crowdworkers and another set of 5 linguistic experts. PiC benchmark is distributed under CC-BY-NC 4.0.
🌟 Official implementation for Data Collection in our paper PiC: A Phrase-in-Context Dataset for Phrase Understanding and Semantic Search.
🌞 Project Link: https://phrase-in-context.github.io/
🔥 Online Web Demo: https://aub.ie/phrase-search
If you use our PiC dataset or software, please consider citing:
@article{pham2022PiC,
title={PiC: A Phrase-in-Context Dataset for Phrase Understanding and Semantic Search},
author={Pham, Thang M and Yoon, Seunghyun and Bui, Trung and Nguyen, Anh},
journal={arXiv preprint arXiv:2207.09068},
year={2022}
}
- Anaconda 4.10 or higher
- Python 3.9 or higher
- pip version 21 or higher
-
Create a new folder and clone the repo
mkdir phrase-in-context && cd "$_" git clone https://github.com/Phrase-in-Context/data-collection.git && cd data-collection
-
Create and activate a Conda environment
conda create -n pic_construct python=3.9 conda activate pic_construct
-
Install required libraries
pip install -r requirements.txt bash extra_requirements.sh
-
Download & Prepare data
bash prepare_data.sh
As there are no English dictionaries that contain sense inventories for multi-word noun phrases (mNPs), the key challenge of our data collection is to find such mNPs p that (1) have multiple senses (e.g. “massive figure” means a large number but also a huge physical shape, depending on the context); and (2) the context documents corresponding to those senses of p.
From a Wikipedia dump, we perform a 6-step procedure for mining a list of mNPs sorted descendingly by their likelihood of containing multiple senses. The most ambiguous 19,500 mNPs are then passed to experts for annotation and others for verification.
The table below is summary of our 3-stage data construction. p, s, m d, q, l denote target phrase, sentence, metadata, document, query, and label, respectively. In this repository, we only focus on 6-step Data Collection method to prepare data for annotation.
Please check out our paper for more details and examples.
- Download Wiki articles and remove empty articles
- Extract phrases (e.g., Noun, Proper Noun) along with their context sentences
python3 extract_wiki_phrases.py
- Remove phrases of a single context
- Find phrases of ambiguous words
- Find phrases in distinct contexts
- Sort and filter by semantic dissimilarity
- Sort by domain dissimilarity
- Select data for expert annotation
python3 find_ambiguous_phrases.py
Upon completion of 6 steps, we get the following file amt_data_19500_sorted_by_semantic_domain_latest.csv
stored under data/preparation/
folder and ready to start the annotation phase.
We also benchmark the state-of-the-art models (e.g., PhraseBERT, DensePhrase, SimCSE) on our proposed tasks. Let's check it out here!
See the open issues for a full list of proposed features (and known issues).
Distributed under the MIT License.
The entire code was done and maintained by Thang Pham, @pmthangxai - [email protected]. Contact us via email or create github issues if you have any questions/requests. Thanks!