Giter VIP home page Giter VIP logo

comhyper's Introduction

ComHyper License: MIT

Code for EMNLP'20 paper "When Hearst Is not Enough: Improving Hypernymy Detection from Corpus with Distributional Models" (arXiv)

In a nutshell, ComHyper is the complementary framework for solving hypernymy detection tasks from the perspective of blind points of Hearst pattern-based methods. As shown in the left Figure, long-tailed nouns cannot well covered by Hearst patterns and thus form non-negligible sparsity types. For such cases, we propose to use supervised distributional models for complmenting pattern-based models shown in the right Figure.

Use ComHyper

1. Download Hearst pattern files and corpus.

First prepare the extracted Hearst pattern pairs such as hearst_counts.txt.gz from the repo hypernymysuite or data-concept.zip from Microsoft Concept Graph (Also known as Probase). Specify the parameter pattern_filename in the config as the file location.

wget https://github.com/facebookresearch/hypernymysuite/blob/master/hearst_counts.txt.gz
curl -L "https://concept.research.microsoft.com/Home/StartDownload" > data-concept.zip

Then extract the contexts for words from large-scale corpus such as Wiki + Gigaword or ukWac. All the contexts for one word should be organized into one txt file and one line for one context.

For those words appearing in the Hearst patterns (IP words), organize their context files into the directory context in the config. For OOP words, organize their context files into the context_oov in the config.

2. Train and evaluate the ComHyper.

For training the distributional models supervsied by the output of pattern-based models, different context encoders are provided:

python train_word2score.py config/word.cfg
python train_context2score.py config/context.cfg
python train_bert2score.py config/bert.cfg

The same evaluation scripts work for all settings. For reproducing the results, run:

python evaluation/evaluation_all_context.py ../config/context.cfg 

Note that we choose not to report the BERT encoder results in our orginial paper due to efficiency but release the relevant codes for incoroporating effective pre-trained contextualized encoders to further improve the performance. Welcome to PR or contact cyuaq # cse.ust.hk !

Citation

Please cite the following paper if you found our method helpful. Thanks !

@inproceedings{yu-etal-2020-hearst,
    title = "When Hearst Is Not Enough: Improving Hypernymy Detection from Corpus with Distributional Models",
    author = "Yu, Changlong and Han, Jialong and Wang, Peifeng and Song, Yangqiu and Zhang, Hongming and Ng, Wilfred and Shi, Shuming",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = "nov",
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-main.502",
    pages = "6208--6217",
}

comhyper's People

Contributors

ccclyu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

comhyper's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.