Giter VIP home page Giter VIP logo

bootleg's Introduction

Build Status license

Self-Supervision for Named Entity Disambiguation at the Tail

Bootleg is a self-supervised named entity disambiguation (NED) system built to improve disambiguation of entities that occur infrequently, or not at all, in training data. We call these entities tail entities. This is a critical task as the majority of entities are rare. The core insight behind Bootleg is that these tail entities can be disambiguated by reasoning over entity types and relations. We give an overview of how Bootleg achieves this below. For details, please see our blog post and paper.

Note that Bootleg is actively under development and feedback is welcome. We have not done extensive tuning or parameter sweeps of our models and have mainly been focused on capturing the right inductive biases through the model and dataflow. Submit bugs on the Issues page or feel free to submit your contributions as a pull request.

Getting Started

Installation

Bootleg requires Python 3.6 or later. We recommend using pip or conda to install.

If using pip:

pip install -r requirements.txt
python setup.py develop

If using conda:

conda env create --name <env_name> --file conda_requirements.yml
python setup.py develop

Note that the requirements assume CUDA 10.2. To use CUDA 10.1, you will need to run:

pip install torch==1.5.0+cu101 torchvision==0.6.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html

or

conda install pytorch==1.5.0 torchvision==0.6.0 cudatoolkit=10.1 -c pytorch

Models

We have six different Bootleg models you can download. Each download comes with the saved model and config to run the model. We show in our benchmark tutorial and end-to-end tutorial how to load a config and run a model.

Model Description Number Parameters Link
Bootleg All entity embeddings with type and KG embeddings. Has an additional title embedding, sentence co-occurrence feature, and page co-occurrence feature. 1.38B Download
BootlegMini Top 5 entity embeddings with type and KG embeddings. Has an additional title embedding, sentence co-occurrence feature, and page co-occurrence feature. 84M Download
BootlegSimple All entity embeddings with type and KG embeddings. 1.37B Download
BootlegSimpleMini Top 5 entity embeddings with type and KG embeddings. 82M Download
BootlegType Type embeddings. 13M Download
BootlegKG KG embeddings. 9M Download

Tutorials

We provide tutorials to help users get familiar with Bootleg here.

Bootleg Overview

Given an input sentence, Bootleg takes the sentence and outputs a predicted entity for each detected mention. Bootleg first extracts mentions in the sentence by querying our the mentions in a pre-mined candidate mapping (see extract_mentions.py). For each mention, we extract its set of possible candidate entities (done in prep.py) and any structural information about that entity, e.g., type information or knowledge graph (KG) information. The structural information is stored as embeddings in their associated embedding classes. Bootleg leverages these embeddings as entity payloads along with the sentence information as word embeddings to predict which entity (possibly the NIL entity) is associated with each mention.

Dataflow

Entity Payload

We use three embeddings for the entity payloads. Each entity gets the following embeddings:

  • Entity: learned embedding
  • Type: learned embedding for each of its types
  • Relation: learned embedding for each relation it participates in on Wikidata

We also allow the use of other entity-based features. In our benchmark model, we use a title embedding and a Wikipedia page co-occurrence statistical feature.

These embeddings are concatenated and projected to form an entity payload.

Architecture

  • Input: contextualized word embeddings (e.g. BERT) and entity payloads
  • Network: uses transformer modules to learn patterns over phrases and entities
    • Phrase Module: attention over the sentence and entity payloads
    • Co-Occurrence Module: self-attention over the entity payloads
    • KG Module: takes the sum of the output of the phrase and co-occurrence modules and leverages KG connectivity among candidates as weights in an attention
  • Score: uses MLP softmax layers to score each mention and candidate independently, selecting the most likely candidate per mention

In the figure above, M represents the maximum number of mentions (or aliases) in the sentence, K represents the maximum number of candidates considered per mention, and N represents the maximum number of sub-words in the sentence. Typically, we use M=10, K=30, and N=100. Additionally, H is the hidden dimension used throughout the backbone, E is the dimension of the learned entity embedding, R the dimension of the learned relation embedding, and T the dimension of the learned type embedding. We further select an entity's 3 most popular types and 50 most unique Wikidata relations. These are all tunable parameters in Bootleg.

Inference

Given a pretrained model, we support three types of inference: --mode eval, --mode dump_preds, and --mode dump_embs. Eval mode is the fastest option and will run the test files through the model and output aggregated quality metrics to the log. Dump_preds mode will write the individual predictions and corresponding probabilities to a jsonlines file. This is useful for error analysis. Dump_embs mode is the same as dump_preds, but will additionally output contextual entity embeddings. These can then be read and processed in a downstream system.

Training

We recommend using GPUs for training Bootleg models. For large datasets, we support distributed training with Pytorch's Distributed DataParallel framework to distribute batches across multiple GPUs. Check out the Basic Training and Advanced Training tutorials for more information and sample data!

Downstream Tasks

Bootleg produces contextual entity embeddings (as well as learned static embeddings) that can be used in downstream tasks, such as relation extraction and question answering. Check out the tutorial to see how this is done.

bootleg's People

Contributors

lorr1 avatar mleszczy avatar vincentschen avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.