Giter VIP home page Giter VIP logo

mhgrn's Introduction

Multi-Hop Graph Relation Networks (EMNLP 2020)

License: MIT

This is the repo of our EMNLP'20 paper:

Scalable Multi-Hop Relational Reasoning for Knowledge-Aware Question Answering
Yanlin Feng*, Xinyue Chen*, Bill Yuchen Lin, Peifeng Wang, Jun Yan and Xiang Ren.
EMNLP 2020.
*=equal contritbution

This repository also implements other graph encoding models for question answering (including vanilla LM finetuning).

  • RelationNet
  • R-GCN
  • KagNet
  • GConAttn
  • KVMem
  • MHGRN (or. MultiGRN)

Each model supports the following text encoders:

  • LSTM
  • GPT
  • BERT
  • XLNet
  • RoBERTa

Resources

We provide preprocessed ConceptNet and pretrained entity embeddings for your own usage. These resources are independent of the source code.

Note that the following reousrces can be download here.

ConceptNet (5.6.0)

Description Downloads Notes
Entity Vocab entity-vocab one entity per line, space replaced by '_'
Relation Vocab relation-vocab one relation per line, merged
ConceptNet (CSV format) conceptnet-5.6.0-csv English tuples extracted from the full conceptnet with merged relations
ConceptNet (NetworkX format) conceptnet-5.6.0-networkx NetworkX pickled format, pruned by filtering out stop words

Entity Embeddings (Node Features)

Entity embeddings are packed into a matrix of shape (#ent, dim) and stored in numpy format. Use np.load to read the file. You may need to download the vocabulary files first.

Embedding Model Dimensionality Description Downloads
TransE 100 Obtained using OpenKE with optim=sgd, lr=1e-3, epoch=1000 entities relations
NumberBatch 300 https://github.com/commonsense/conceptnet-numberbatch entities
BERT-based 1024 Provided by Zhengwei entities

Dependencies

Run the following commands to create a conda environment (assume CUDA10):

conda create -n krqa python=3.6 numpy matplotlib ipython
source activate krqa
conda install pytorch=1.1.0 torchvision cudatoolkit=10.0 -c pytorch
pip install dgl-cu100==0.3.1
pip install transformers==2.0.0 tqdm networkx==2.3 nltk spacy==2.1.6
python -m spacy download en

Usage

1. Download Data

First, you need to download all the necessary data in order to train the model:

git clone https://github.com/INK-USC/MHGRN.git
cd MHGRN
bash scripts/download.sh

The script will:

2. Preprocess

To preprocess the data, run:

python preprocess.py

By default, all available CPU cores will be used for multi-processing in order to speed up the process. Alternatively, you can use "-p" to specify the number of processes to use:

python preprocess.py -p 20

The script will:

  • Convert the original datasets into .jsonl files (stored in data/csqa/statement/)
  • Extract English relations from ConceptNet, merge the original 42 relation types into 17 types
  • Identify all mentioned concepts in the questions and answers
  • Extract subgraphs for each q-a pair

The preprocessing procedure takes approximately 3 hours on a 40-core CPU server. Most intermediate files are in .jsonl or .pk format and stored in various folders. The resulting file structure will look like:

.
├── README.md
└── data/
    ├── cpnet/                 (prerocessed ConceptNet)
    ├── glove/                 (pretrained GloVe embeddings)
    ├── transe/                (pretrained TransE embeddings)
    └── csqa/
        ├── train_rand_split.jsonl
        ├── dev_rand_split.jsonl
        ├── test_rand_split_no_answers.jsonl
        ├── statement/             (converted statements)
        ├── grounded/              (grounded entities)
        ├── paths/                 (unpruned/pruned paths)
        ├── graphs/                (extracted subgraphs)
        ├── ...

3. Hyperparameter Search (optional)

To search the parameters for RoBERTa-Large on CommonsenseQA:

bash scripts/param_search_lm.sh csqa roberta-large

To search the parameters for BERT+RelationNet on CommonsenseQA:

bash scripts/param_search_rn.sh csqa bert-large-uncased

4. Training

Each graph encoding model is implemented in a single script:

Graph Encoder Script Description
None lm.py w/o knowledge graph
Relation Network rn.py
R-GCN rgcn.py Use --gnn_layer_num and --num_basis to specify #layer and #basis
KagNet kagnet.py Adapted from https://github.com/INK-USC/KagNet, still tuning
Gcon-Attn gconattn.py
KV-Memory kvmem.py
MHGRN grn.py

Some important command line arguments are listed as follows (run python {lm,rn,rgcn,...}.py -h for a complete list):

Arg Values Description Notes
--mode {train, eval, ...} Training or Evaluation default=train
-enc, --encoder {lstm, openai-gpt, bert-large-unased, roberta-large, ....} Text Encoer Model names (except for lstm) are the ones used by huggingface-transformers, default=bert-large-uncased
--optim {adam, adamw, radam} Optimizer default=radam
-ds, --dataset {csqa, obqa} Dataset default=csqa
-ih, --inhouse {0, 1} Run In-house Split default=1, only applicable to CSQA
--ent_emb {transe, numberbatch, tzw} Entity Embeddings default=tzw (BERT-based node features)
-sl, --max_seq_len {32, 64, 128, 256} Maximum Sequence Length Use 128 or 256 for datasets that contain long sentences! default=64
-elr, --encoder_lr {1e-5, 2e-5, 3e-5, 6e-5, 1e-4} Text Encoder LR dataset specific and text encoder specific, default values in utils/parser_utils.py
-dlr, --decoder_lr {1e-4, 3e-4, 1e-3, 3e-3} Graph Encoder LR dataset specific and model specific, default values in {model}.py
--lr_schedule {fixed, warmup_linear, warmup_constant} Learning Rate Schedule default=fixed
-me, --max_epochs_before_stop {2, 4, 6} Early Stopping Patience default=2
--unfreeze_epoch {0, 3} Freeze Text Encoder for N epochs model specific
-bs, --batch_size {16, 32, 64} Batch Size default=32
--save_dir str Checkpoint Directory model specific
--seed {0, 1, 2, 3} Random Seed default=0

For example, run the following command to train a RoBERTa-Large model on CommonsenseQA:

python lm.py --encoder roberta-large --dataset csqa

To train a RelationNet with BERT-Large-Uncased as the encoder:

python rn.py --encoder bert-large-uncased

To reproduce the reported results of MultiGRN on CommonsenseQA official set:

bash scripts/run_grn_csqa.sh

5. Evaluation

To evaluate a trained model (you need to specify --save_dir if the checkpoint is not stored in the default directory):

python {lm,rn,rgcn,...}.py --mode eval [ --save_dir path/to/directory/ ]

Use Your Own Dataset

  • Convert your dataset to {train,dev,test}.statement.jsonl in .jsonl format (see data/csqa/statement/train.statement.jsonl)
  • Create a directory in data/{yourdataset}/ to store the .jsonl files
  • Modify preprocess.py and perform subgraph extraction for your data
  • Modify utils/parser_utils.py to support your own dataset
  • Tune encoder_lr,decoder_lr and other important hyperparameters, modify utils/parser_utils.py and {model}.py to record the tuned hyperparameters

mhgrn's People

Contributors

aarzchan avatar shanzhenren avatar yanlinf avatar yuchenlin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mhgrn's Issues

Bert embeddings missing

Thanks for the repo! I noticed that the train.bert.large.layer-2.epoch1.npy and dev.bert.large.layer-2.epoch1.npy are still missing. When I use wget, it returns 403 forbidden. I think the links are expired.

modeling/model_kagnet.py has an error

Traceback (most recent call last):
File "kagnet.py", line 283, in
main()
File "kagnet.py", line 85, in main
train(args)
File "kagnet.py", line 220, in train
logits, _ = model(*[x for x in input_data], layer_id=args.encoder_layer)
File "/root/anaconda3/envs/krqa/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/dev/MHGRN/modeling/modeling_kagnet.py", line 416, in forward
*inputs = [x.view(x.size(0) * x.size(1), x.size()[2:]) for x in inputs[:-7]] + inputs[-7:] # merge the batch dimension and the num_choice dimension
TypeError: can only concatenate list (not "tuple") to list

The concept2id index in GconAttn are shifted by one

Hey, thanks for providing the code.
I am running one of the baseline model GconAttn. I am really confused about the concept2id part code.
For line 155 and line 156 in modelling_gconattn.py, the id of each concept are added by 1. The comment notes leave index 0 for padding. What I am confusing is that after adding 1, the index of a concept will be changed, and the corresponding embedding retrieved from the embedding matrix will no longer be correct. Did I miss something or my understanding is correct? Thank you so much for your kind help.

for data in tqdm(concept_data, total=n, desc='loading concepts'):  
            # leave index 0 for padding. 
            cur_qc = [self.concept2id[x] + 1 for x in data['qc']][:self.max_cpt_num]. 
            cur_ac = [self.concept2id[x] + 1 for x in data['ac']][:self.max_cpt_num]. 
            qc.append(cur_qc + [0] * (self.max_cpt_num - len(cur_qc))). 
            ac.append(cur_ac + [0] * (self.max_cpt_num - len(cur_ac)))

Not able to run kagnet or grn

I have followed all the steps. Ran the preprocess.py too. But, I am not able train kagnet or grn, I am getting the following error.
"FileNotFoundError: [Errno 2] No such file or directory: './data/cpnet/tzw.ent.npy'". Can someone please help me to train kagnet.

Using LSTM as text encoder on CSQA achieves low performance

Hi, thanks for sharing this project.

I am wondering have you ever run the experiment on CSQA using the Bi-LSTM as the sentence encoder?
I run the experiment but got very low accuracy on the in-house split: IHdev=22.5, IHtest=20.7. Does it always perform so low or are there any parameters that I need to set up specifically?

KagNet (Table 3) reports the results of using BLSTM are IHdev=34.79, IHtest=32.12, much higher than the results I got.
Is there any difference between the BiLSTM used in this code and the KagNet code? For example, the question and answer are encoded separately or concatenated into one sequence? The final QA representation is obtained by max-pooling or meaning pooling?

Thank you so much for your help.

resource missing

Can you provide the resource of pretrained TransE embeddings and BERT features? The original links are invalid.

Requirements for `transformers` incorrect

According to the environment creation that is outlined, transformers==2.0.0. If this is done, then any of the models containing the line

from transformers import (OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP, 
                          BERT_PRETRAINED_CONFIG_ARCHIVE_MAP, 
                          XLNET_PRETRAINED_CONFIG_ARCHIVE_MAP,
                          ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP,
                          ALBERT_PRETRAINED_MODEL_ARCHIVE_MAP)

will fail upon trying to import ALBERT_PRETRAINED_MODEL_ARCHIVE_MAP because it wasn't implemented in transformers until at least transformers==2.2.0

It appears that 2.2.0 makes it past the import, but I'm curious if you know whether this will impact your models and which versions of transformers will work.

./data/cpnet/tzw.ent.npy

Hi, I noticed that the node features are described in the paper as:

For the input node features, we first use templates to turn knowledge triples in ConceptNet into sentences and feed them into pre-trained BERTLARGE, obtaining a sequence of token embeddings from the last layer of BERT-LARGE for each triple. For each entity in ConceptNet, we perform mean pooling over the tokens of the entity’s occurrences across all the sentences to form a 1024d vector as its corresponding node feature. We use this set of features for all our implemented models.

Since the pre-training model used in the paper is roberta, why use bert instead of roberta to encode nodes?
Is there any open source code for the method of obtaining node features used in the paper? I want to try some changes on it. Thanks!

No such file or directory: './data/cpnet/tzw.ent.npy'

When I run code in rn.py, kagnet.py, gconattn.py, grn.py, it shows No such file or directory: './data/cpnet/tzw.ent.npy'. And my environment is the recommanded environment.
image

So where can I get the dictory?Thank you!

TypeError: expected str, bytes or os.PathLike object, not NoneType

Thanks for sharing the source code . When I run the "preprocess.py" some error appeared . Could you please tell me how to sovle the problem ?
Traceback (most recent call last):
File "D:/workspace/MHGRN/preprocess.py", line 611, in
main()
File "D:/workspace/MHGRN/preprocess.py", line 605, in main
rt_dic'func'
File "D:\workspace\MHGRN\utils\grounding.py", line 327, in ground
res = match_mentioned_concepts(sents, answers, num_processes)
File "D:\workspace\MHGRN\utils\grounding.py", line 239, in match_mentioned_concepts
res = list(tqdm(p.imap(ground_qa_pair, zip(sents, answers)), total=len(sents)))
File "D:\Anaconda\Anaconda\envs\krqa\lib\site-packages\tqdm\std.py", line 1167, in iter
for obj in iterable:
File "D:\Anaconda\Anaconda\envs\krqa\lib\multiprocessing\pool.py", line 735, in next
raise value
TypeError: expected str, bytes or os.PathLike object, not NoneType

Evaluation does not work for RelationNet

The evaluation function was not implemented in the code. So, I implemented it but I get a CUDA error which I traced back to transformers. I am hereby attaching the screenshot of the error. So for context I trained the relationNet using roberta as the encoder and then when I try to evaluate the trained model, I get this error.
Screenshot (20)

How to know other node's text information?

Hello,
Thank you for providing such an excellent paper.
I am a student with a lot of interest in this field. I know that the CommonsenseQA dataset has 5 correct labels per question, and accordingly, 5 subgraphs occur.
Here, each subgraph has 200 nodes.
There are some types of nodes: question, answer, and other nodes. What is the way to know the text information of other nodes?
Thank you!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.