Giter VIP home page Giter VIP logo

synonymnet's Introduction

Entity Synonym Discovery via Multipiece Bilateral Context Matching

This project provides source code and data for SynonymNet, a model that detects entity synonyms via multipiece bilateral context matching.

Details about SynonymNet can be accessed here, and the implementation is based on the Tensorflow library.

Quick Links

Installation

For training, a GPU is recommended to accelerate the training speed.

Tensorflow

The code is based on Tensorflow 1.5 and can run on Tensorflow 1.15.0. You can find installation instructions here.

Dependencies

The code is written in Python 3.7. Its dependencies are summarized in the file requirements.txt.

tensorflow_gpu==1.15.0
numpy==1.14.0
pandas==0.25.1
gensim==3.8.1
scikit_learn==0.21.2

You can install these dependencies like this:

pip3 install -r requirements.txt

Usage

  • Run the model on Wikipedia+Freebase dataset with the siamese architecture and the default hyperparameter settings
    cd src
    python3 train_siamese.py --dataset=wiki

  • For all available hyperparameter settings, use
    python3 train_siamese.py -h

  • Run the model on Wikipedia+Freebase dataset with the triplet architecture and the default hyperparameter settings
    cd src
    python3 train_triplet.py --dataset=wiki

Data

Format

Data Each dataset is a folder under the ./input_data folder, where each sub-folder indicates a train/val/test split:

./data
└── wiki
    ├── train
    |   ├── siamese_contexts.txt
    |   └── triple_contexts.txt
    ├── valid
    |   ├── siamese_contexts.txt
    |   └── triple_contexts.txt    
    ├── test
    |   ├── knn-siamese_contexts.txt
    |   ├── knn_triple_contexts.txt
    |   ├── siamese_contexts.txt
    |   └── triple_contexts.txt
    └── skipgram-vec200-mincount5-win5.bin
    └── fasttext-vec200-mincount5-win5.bin
    └── in_vovab (build during training)

In each sub-folder,

  • siamese_contexts.txt file contains entities and contexts for the siamese architecture. Each line has five columns, seperated by \t: entity_a \t entity_b \t context_a1@@context_a2...context_an \t context_b1@@context_b2@@...@@context_bn \t label.

    • entity_a and entity_b indicate two entities. e.g. u.s._government||m.01bqks|| and united_states||m.01bqks||.
    • The next two columns indicate the contexts of two entities. e.g. context_a1@@context_a2...context_an indicates n pieces of contexts where entity_a is mentioned. @@ is used to seperate contexts.
    • label is a binary value indicating synonymity.
  • triple_contexts.txt file contains entities and contexts for the triplet architecture. Each line has six columns, seperated by \t: entity_a \t entity_pos \t entity_neg \t context_a1@@context_a2...context_an \t context_pos_1@@context_pos_2@@...@@context_pos_p \t context_neg_1@@context_neg_2@@...@@context_neg_q.
    where entity_a denotes one entity and entity_pos denotes a synonym entity of entity_a and entity_neg as a negative sample of entity_a.

  • *-vec200-mincount5-win5.bin is a binary file stores the pre-trained word embedding trained using the corpus in the dataset.

  • in_vocab is a vocabulary file generated automatically during training.

Download

Pre-trained word vectors and datasets can be downloaded here:

Dataset Link
Wikipedia + Freebase https://drive.google.com/open?id=1uX4KU6ws9xIIJjfpH2He-Yl5sPLYV0ws
PubMed + UMLS https://drive.google.com/open?id=1cWHhXVd_Pb4N3EFdpvn4Clk6HVeWKVfF

Work on your own data

Prepare and organize your dataset in a folder according to the format and put it under ./input_data/ and use --dataset=foldername during training.

For example, your dataset is ./input_data/mydata, then you need to use the flag --dataset=mydata for train_triplet.py.
Your dataset should be seperated to three folders - train, test, and valid, which is named 'train', 'test', and 'valid' by default setting of train_triplet.py or train_siamese.py.

Reference

@inproceedings{zhang2020entity,
  title={Entity Synonym Discovery via Multipiece Bilateral Context Matching},
  author={Zhang, Chenwei and Li, Yaliang and Du, Nan and Fan, Wei and Yu, Philip S},
  booktitle={Proceedings of the 29th International Joint Conference on Artificial Intelligence (IJCAI)},
  year={2020}
}

synonymnet's People

Contributors

czhang99 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.