Giter VIP home page Giter VIP logo

benevolentai / relvm Goto Github PK

View Code? Open in Web Editor NEW
14.0 3.0 3.0 35 KB

This repository contains the code accompanying the paper "Learning Informative Representations of Biomedical Relations with Latent Variable Models", Harshil Shah and Julien Fauqueur, EMNLP SustaiNLP 2020.

License: MIT License

Python 100.00%
nlp-machine-learning emnlp2020 biomedical information-extraction probabilistic-models representation-learning

relvm's Introduction

RELVM

This repository contains the code accompanying the paper "Learning Informative Representations of Biomedical Relations with Latent Variable Models", Harshil Shah and Julien Fauqueur, EMNLP SustaiNLP 2020, (https://arxiv.org/abs/2011.10285).

Requirements

  • Python 3.7
    • Numpy >= 1.17.2
    • Tensorflow >= 2.0.0

Instructions

Introduction

The code in this repository is for training a latent variable generative model of pairs of entities and the contexts (i.e. sentences) in which the entities occur. The representations from this model can then be used to perform both mention-level and pair-level classification.

Throughout the code, the following conventions are used:

  • x or entities_x will refer to the first entity in a context.
  • y or entities_y will refer to the second entity in a context.
  • c or contexts will refer to the context (i.e. sentence) in which the entities occur.
  • r or labels will refer to the class label when performing either mention-level or pair-level classification.

Data

To avoid out-of-memory issues, the data is stored in memory-mapped Numpy arrays. The metadata is stored in JSON files.

Unsupervised

The unsupervised data directory (e.g. data/unsupervised) should contain the following JSON files (which contain the metadata):

  • vocab.json
    • This is a list of strings. It is the set of possible tokens which can appear in the contexts.
  • entity_types.json
    • This is a list of strings. It is the set of possible values for the entities. If training the model for mention-level classification, this will be a list of entity types. If training the model for pair-level classification, this will be a list of entity identifiers.

It should also contain the following memory-mapped Numpy arrays:

  • entities_x.mmap
    • This is a one-dimensional array. The ith row contains the index to entity_types.json for the first entity in the ith context.
  • entities_y.mmap
    • This is a one-dimensional array. The ith row contains the index to entity_types.json for the second entity in the ith context.
  • contexts.mmap
    • This is a two-dimensional array. The ith row contains the indices to vocab.json for the ith context.

Mention-level classification

The mention-level classification data directory (e.g. data/supervised/mention_level) should contain the following JSON files (which contain the metadata):

  • label_types.json
    • This is a list of strings. It is the set of possible values for the labels.

It should also contain the following memory-mapped Numpy arrays (for the training, validation, and test data respectively):

  • entities_x_{train,valid,test}.mmap
    • These are one-dimensional arrays. The ith row contains the index to entity_types.json from the unsupervised data directory for the first entity in the ith context.
  • entities_y_{train,valid,test}.mmap
    • These are one-dimensional arrays. The ith row contains the index to entity_types.json from the unsupervised data directory for the second entity in the ith context.
  • contexts_{train,valid,test}.mmap
    • These are two-dimensional arrays. The ith row contains the indices to vocab.json from the unsupervised data directory for the ith context.
  • labels_{train,valid,test}.mmap
    • These are one-dimensional arrays. The ith row contains the index to label_types.json for the ith entity pair and context.

Pair-level classification

The pair-level classification data directory (e.g. data/supervised/pair_level) should contain the following JSON files (which contain the metadata):

  • label_types.json
    • This is a list of strings. It is the set of possible values for the labels.
  • pos_entity_pairs.json
    • This is a list of strings containing the entity pairs which have a positive relation. Each element of this list contains the unique entity identifiers for the pair, joined together with a :.

It should also contain the following memory-mapped Numpy arrays (for the training, validation, and test data respectively):

  • entities_x_{train,valid,test}.mmap
    • These are one-dimensional arrays. The ith row contains the index to entity_types.json from the unsupervised data directory for the first entity in the ith pair.
  • entities_y_{train,valid,test}.mmap
    • These are one-dimensional arrays. The ith row contains the index to entity_types.json from the unsupervised data directory for the second entity in the ith pair.
  • labels_{train,valid,test}.mmap
    • These are one-dimensional arrays. The ith row contains the index to label_types.json for the ith entity pair.

Training and evaluating

Unsupervised representation learning

To train the unsupervised representation model, run exp_unsup.py, specifying the directory in which to store the model parameters. For example:

mkdir -p exp_outputs/unsup

python3 exp_unsup.py exp_outputs/unsup

Mention-level classification

Once the unsupervised representation model has been trained, the mention-level classification model can be trained and evaluated by running exp_classification_mention.py. First, the following variables in exp_classification_mention.py must be set:

  • trainer_unsup_pre_trained_dir must be set to the directory with the saved parameters from the unsupervised representation model (e.g. exp_outputs/unsup).
  • unsup_data_dir must be set to the directory with the data used to train the unsupervised representation model (e.g. data/unsupervised).

When running exp_classification_mention.py, the directory in which to store the classification model parameters must be specified. For example:

mkdir -p exp_outputs/classification_mention

python3 exp_classification_mention.py exp_outputs/classification_mention

Pair-level classification

The pair-level classification model can be trained and evaluated by running exp_classification_pair.py in an identical fashion to the mention-level classification model. Again, the following variables in exp_classification_pair.py must be set:

  • trainer_unsup_pre_trained_dir must be set to the directory with the saved parameters from the unsupervised representation model (e.g. exp_outputs/unsup).
  • unsup_data_dir must be set to the directory with the data used to train the unsupervised representation model (e.g. data/unsupervised).

When running exp_classification_pair.py, the directory in which to store the classification model parameters must be specified. For example:

mkdir -p exp_outputs/classification_pair

python3 exp_classification_pair.py exp_outputs/classification_pair

Component testing

From the project root folder, the following 3 scripts should be run in this order and all return OK.

python3 tests/test_unsup.py
python3 tests/test_classification_mention.py
python3 tests/test_classification_pair.py

relvm's People

Contributors

quaf avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

relvm's Issues

Confusing parts in the entities_x definition

In the make_memmap code description, there is the following statement:
'''entities_x : np.memmap The first entity in a sentence. The first column contains the index to the UUIDS; the second contains the index to the entity type.'''
which my understanding is that entities_x.mmap is a 2d array with two columns. While in the readme you describe it as a one-dimensional array. Also, the code is a little bit confusing when you define the shape of mm as num_data, 1, but also use mm[:, 0]=

Question on the unsupervised representation learning part

Large corpora like PubMed abstracts are assumed to be tagged with entity types. Can you describe how this is done? If this is done by dictionary matching, how many sentences are satisfied when the requirements are 1) each sentence should have at least 2 entities, and 2) one is a gene type and the other is a disease type?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.