microsoft / sdnet Goto Github PK

SDNet

License: MIT License

Python 100.00%

sdnet's Introduction

SDNet

This is the official code for the Microsoft's submission of SDNet model to CoQA leaderboard. It is implemented under PyTorch framework. The related paper to cite is:

SDNet: Contextualized Attention-based Deep Network for Conversational Question Answering, by Chenguang Zhu, Michael Zeng and Xuedong Huang, at https://arxiv.org/abs/1812.03593.

For usage of this code, please follow Microsoft Open Source Code of Conduct.

Directory structure:

main.py: the starter code
Models/
- BaseTrainer.py: Base class for trainer
- SDNetTrainer.py: Trainer for SDNet, including training and predicting procedures
- SDNet.py: The SDNet network structure
- Layers.py: Related network layer functions
- Bert/
  - Bert.py: Customized class to compute BERT contextualized embedding
    - modeling.py, optimization.py, tokenization.py: From Huggingface's PyTorch implementation of BERT
Utils/
- Arguments.py: Process argument configuration file
- Constants.py: Define constants used
- CoQAPreprocess.py: preprocess CoQA raw data into intermediate binary/json file, including tokenzation, history preprending
- CoQAUtils.py, General Utils.py: utility functions used in SDNet
- Timing.py: Logging time

How to run

Requirement: PyTorch 0.4.1, spaCy 2.0.16. The docker we used is available at dockerhub: https://hub.docker.com/r/zcgzcgzcg/squadv2/tags. Please use v3.0 or v4.0.

Create a folder (e.g. coqa) to contain data and running logs;
Create folder coqa/data to store CoQA raw data: coqa-train-v1.0.json and coqa-dev-v1.0.json;
Copy the file conf from the repo into folder coqa;
If you want to use BERT-Large, download their model into coqa/bert-large-uncased; if you want to use BERT-base, download their model into coqa/bert-base-cased;
- The models can be downloaded from Huggingface:
  - 'bert-base-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz",
  - 'bert-large-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased.tar.gz"
- bert-large-uncased-vocab.txt can be downloaded from Google's BERT repository
Create a folder glove in the same directory of coqa and download GloVe embedding glove.840B.300d.txt into the folder.

Your directory should look like this:

coqa/
- data/
  - coqa-train-v1.0.json
  - coqa-dev-v1.0.json
- bert-large-uncased/
  - bert-large-uncased-vocab.txt
  - bert_config.json
  - pytorch_model.bin
- conf
glove/
- glove.840B.300d.txt

Then, execute python main.py train path_to_coqa/conf.

If you run for the first time, CoQAPreprocess.py will automatically create folders conf~/spacy_intermediate_features~ inside coqa to store intermediate tokenization results, which will take a few hours.

Every time you run the code, a new running folder run_idx will be created inside coqa/conf~, which contains running logs, prediction result on dev set, and best model.

Contact

If you have any questions, please contact Chenguang Zhu, [email protected]

sdnet's People

Contributors

Stargazers

Watchers

Forkers

lrh000 sparkjiao amoshua zjms jdc08161063 fendaq biubiubill cyzhangathit jingyaojing lixinsu natelipus0x dbrroxane angrosh zmwebdev chavesliu standrinks allenye0119 dwhitena webblearning flopascual nangeblog ashutosh-adhikari makigumoe wunaidev righ120 binbinbian khronosplus daishu7 pragnakalpdev8 royhuang9 thashim shanepeckham elnaaz longmeix zixuwang1996 yatsenz lapis-hong aditi098 jrchia1988 ericdoug-qi lcrun jcaip harshrangwala maobui2907 zhenyangiacas tahani-2018 balajibetadur xiaoanshi bhaskers-blu-org2 masinazarian xrosliang amirunpri2018 taffywrinkle kiminh claudiusgonzo odellus jinfenli jacampo xhsun1997 wentingtseng satomun volanda-zhu arthurandfranz rikukawamura foreveryaoge tulpen moisesher youngyoung1021 qxl-space shauli-ravfogel hsbedin louisgzli yueyingshuo hoho-wenda0228 media1129 ankit-peshin asagar60 konfucian haojiepan1 maliaosaide techping faranejalalifarahani naimmadar standardgalactic ryansoq thangasami apanwariisc likhith00 awangji chenganhsieh jzhang38 saiparsa zyang-ur yoreg123 mehammedamine gingerng ava-yy kamel773 kiran-prajapati jdmenke

sdnet's Issues

How to use func::gen_upper_triangle

Why we want to use gen_upper_triangle like

SDNet/Utils/CoQAUtils.py

Line 80 in adef469

Get upper triangle matrix from start and end scores (batch)

instead of just calculating start_loss and end_loss separately like

start_loss = loss_fct(start_logits, start_positions)
end_loss = loss_fct(end_logits, end_positions)

https://github.com/huggingface/pytorch-transformers/blob/44dd941efb602433b7edc29612cbdd0a03bf14dc/pytorch_transformers/modeling_bert.py#L1250

I ran this code directly, why do I just get 70.772% F1?

The result in the original paper should be 78% F1.

Inference/Prediction module

Is there a way to test the trained model with our own data? Like passing passage and questions as input and getting answer span and score for it respectively.

Mapping Bert model contents

I have followed all the steps as mentioned in the readme. In the file structure given, inside the bert-large-uncased folder, it is mentioned as

bert-large-uncased-vocab.txt
bert_config.json
pytorch_model.bin

but in the actual models downloaded from the bert repo will have the following,
vocab.txt --> bert-large-uncased-vocab.txt
bert_config.json --> bert_config.json
bert_model.ckpt.index
bert_model.ckpt.meta
bert_model.ckpt.data-00000-of-00001

among the other three I tried replacing each one of them as pytorch_model.bin but each one of them is giving a pickling error, only the key value displayed for error changes.

Traceback (most recent call last):
File "main.py", line 33, in
model.train()
File "/home/crm-di/SDNet/Models/SDNetTrainer.py", line 66, in train
self.setup_model(vocab_embedding)
File "/home/crm-di/SDNet/Models/SDNetTrainer.py", line 137, in setup_model
self.network = SDNet(self.opt, vocab_embedding)
File "/home/crm-di/SDNet/Models/SDNet.py", line 63, in init
self.Bert = Bert(self.opt)
File "/home/crm-di/SDNet/Models/Bert/Bert.py", line 24, in init
self.bert_model = BertModel.from_pretrained(model_file)
File "/home/crm-di/SDNet/Models/Bert/modeling.py", line 505, in from_pretrained
state_dict = torch.load(weights_path)
File "/home/crm-di/SDNet/env/lib/python3.7/site-packages/torch/serialization.py", line 358, in load
return _load(f, map_location, pickle_module)
File "/home/crm-di/SDNet/env/lib/python3.7/site-packages/torch/serialization.py", line 532, in _load
magic_number = pickle_module.load(f)
_pickle.UnpicklingError: invalid load key, '\x0a'.

Can more details about the bert models could be added?

Download bert models

Hi,

I have download bert-large-uncased from "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased.tar.gz", which only gives me the pytorch_model.bin and bert_config.json.
In order to get the bert-large-uncased-vocab.txt missing, I've downloaded the model from https://github.com/google-research/bert and renamed vocab.txt into bert-large-uncased-vocab.txt

I guess that it was not the right solution because I got an error

Using BERT Large model
Loading tokenizer from ../coqa/bert-large-uncased/bert-large-uncased-vocab.txt
02/27/2019 14:02:02 - INFO - Models.Bert.tokenization -   loading vocabulary file ../coqa/bert-large-uncased/bert-large-uncased-vocab.txt
*****************
prev_ques   : 2
prev_ans    : 2
ques_max_len: 140
****************
/path/SDNet/Models/SDNet.py:286: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
  alpha_softmax = F.softmax(alpha)
Traceback (most recent call last):
  File "main.py", line 33, in <module>
    model.train()
  File "/path/SDNet/Models/SDNetTrainer.py", line 126, in train
    self.update(batch)
  File "/path/SDNet/Models/SDNetTrainer.py", line 182, in update
    targets = torch.LongTensor(np.array(targets))
TypeError: can't convert np.ndarray of type numpy.object_. The only supported types are: double, float, float16, int64, int32, and uint8.

Any idea about the place I can get the data or fix it??

train error with BERT model in pytorch 0.4.0 or 0.4.1

IN pytorch ==0.4.1
when i training model with Bert_large_model, it shows ERROR in NO.288 of Models/SDNet.py as follows:
t = output[i] * alpha_softmax[i] * gamma

/da1/home/berton/project/SDNet/Models/SDNet.py:286: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
alpha_softmax = F.softmax(alpha)
Traceback (most recent call last):
File "main.py", line 35, in
model.train()
File "/da1/home/berton/project/SDNet/Models/SDNetTrainer.py", line 126, in train
self.update(batch)
File "/da1/home/berton/project/SDNet/Models/SDNetTrainer.py", line 165, in update
query, query_mask, query_char, query_char_mask, query_bert, query_bert_mask, query_bert_offsets, len(context_words))
File "/home/berton/py35/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/da1/home/berton/project/SDNet/Models/SDNet.py", line 192, in forward
x_cemb_mid = self.linear_sum(x_bert_output, self.alphaBERT, self.gammaBERT)
File "/da1/home/berton/project/SDNet/Models/SDNet.py", line 288, in linear_sum
t = output[i] * alpha_softmax[i] * gamma
RuntimeError: sizes must be non-negative

However, I guess it's due to version issues(pytorch/pytorch#11478) ，and change to Pytorch ==0.4.0 again, it shows another ERROR shows :

/da1/home/berton/project/SDNet/Models/SDNet.py:286: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
alpha_softmax = F.softmax(alpha)
Traceback (most recent call last):
File "main.py", line 35, in
model.train()
File "/da1/home/berton/project/SDNet/Models/SDNetTrainer.py", line 126, in train
self.update(batch)
File "/da1/home/berton/project/SDNet/Models/SDNetTrainer.py", line 165, in update
query, query_mask, query_char, query_char_mask, query_bert, query_bert_mask, query_bert_offsets, len(context_words))
File "/home/berton/anaconda3/lib/python3.5/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/da1/home/berton/project/SDNet/Models/SDNet.py", line 192, in forward
x_cemb_mid = self.linear_sum(x_bert_output, self.alphaBERT, self.gammaBERT)
File "/da1/home/berton/project/SDNet/Models/SDNet.py", line 292, in linear_sum
res += t
RuntimeError: The expanded size of the tensor (1024) must match the existing size (388) at non-singleton dimension 2

Please add support for pytorch 1.0.0 and msgpack-numpy 0.4.4.1

I used msgpack-numpy==0.4.4.1, but the code goes wrong when store and load preprocessed file.

I used pytorch==1.0.0, but it goes wrong when starting training.

So I have to downgrade to msgpack-numpy==0.4.3.2 and pytorch==0.4.1, and they worked!

I would be very grateful if you could add support for the higher version, thanks~

This repo is missing important files

There are important files that Microsoft projects should all have that are not present in this repository. A pull request has been opened to add the missing file(s). When the pr is merged this issue will be closed automatically.

Microsoft teams can learn more about this effort and share feedback within the open source guidance available internally.

Merge this pull request

Python 3.4 compatibility

It seems async is a reserved keyword since python 3.4, it would be nice if you could update your code to reflect this change. Just change async=True in your code to non_blocking=True would make the code compatible with python >= 3.4

When context is larger than BERT_MAX_LEN

In the forward function here: (https://github.com/Microsoft/SDNet/blob/master/Models/Bert/Bert.py#L57)

You divide the tokens to deal with the context window larger than BERT_MAX_LEN. However, if you do this, then in the subsequent loops, the truncated BERT tokens are no longer valid "bertify" tokens, since it does not contain <CLS> as the beginning of the input sentence, will it cause a problem to get correct BERT representations ?

Fluctuation on dev sanity checks

Hi,

I tried to conduct the sanity checks on https://groups.google.com/forum/#!forum/coqa, obtaining a little inconsistency after removing rationales and future context. Is it normal?

Why training with ELMo instead of BERT does not improve results ?

Hello,

I have tried to use ELMo instead of BERT as you can see on my fork
The training is working but the results are very similar with the training without any contextual embedding (just GLoVE).
Do you have any idea why or how to fix it?
I think that I might have forgotten smth in my code...

Moreover I can notice that x_cemb and ques_cemb are never instanciate, they are always None, would this be part of the issue?

Thanks in advance

train with bert model

the Requirement is PyTorch 0.4.0
but the line No.518 of Models/Bert/modeling.py is：
module._load_from_state_dict(state_dict, prefix, local_metadata, True, missing_keys, unexpected_keys, error_msgs)
the code is for Pytorch 0.4.1 ( huggingface/transformers#122)
When i removed "local_metadata", it shows the ERROR as follows. could anyone can help?

Loading train json...
Loading dev json...
Epoch 0
Using BERT Large model
Loading tokenizer from test_coqa/bert-large-uncased/bert-large-uncased-vocab.txt
04/18/2019 11:27:58 - INFO - Models.Bert.tokenization - loading vocabulary file test_coqa/bert-large-uncased/bert-large-uncased-vocab.txt

prev_ques : 2
prev_ans : 2
ques_max_len: 140

Using BERT Large model
Loading tokenizer from test_coqa/bert-large-uncased/bert-large-uncased-vocab.txt
04/18/2019 11:27:58 - INFO - Models.Bert.tokenization - loading vocabulary file test_coqa/bert-large-uncased/bert-large-uncased-vocab.txt

prev_ques : 2
prev_ans : 2
ques_max_len: 140

/da1/home/berton/project/SDNet/Models/SDNet.py:286: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
alpha_softmax = F.softmax(alpha)
Traceback (most recent call last):
File "main.py", line 35, in
model.train()
File "/da1/home/berton/project/SDNet/Models/SDNetTrainer.py", line 126, in train
self.update(batch)
File "/da1/home/berton/project/SDNet/Models/SDNetTrainer.py", line 165, in update
query, query_mask, query_char, query_char_mask, query_bert, query_bert_mask, query_bert_offsets, len(context_words))
File "/home/berton/anaconda3/lib/python3.5/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/da1/home/berton/project/SDNet/Models/SDNet.py", line 192, in forward
x_cemb_mid = self.linear_sum(x_bert_output, self.alphaBERT, self.gammaBERT)
File "/da1/home/berton/project/SDNet/Models/SDNet.py", line 292, in linear_sum
res += t
RuntimeError: The expanded size of the tensor (1024) must match the existing size (388) at non-singleton dimension 2

关于fine-tuning bert的问题

请问一下作者在实验过程中有尝试过fine-tuning bert的参数吗？如果尝试过，请问一下卡的类型、显存大小、训练时间分别是多少？

Is there a code for the tensorflow version?

你好，我对SDNET模型的encoding 层特别感兴趣，特别是创新的使用BERT部分。请问下，你后续会出TF版本的代码吗？或者就encoding部分的TF代码？

CUDA error: out of memory

While training I am getting this error

Traceback (most recent call last):
File "main.py", line 33, in
model.train()
File "/home/SDNet/Models/SDNetTrainer.py", line 126, in train
self.update(batch)
File "/home/SDNet/Models/SDNetTrainer.py", line 164, in update
query, query_mask, query_char, query_char_mask, query_bert, query_bert_mask, query_bert_offsets, len(context_words))
File "/home/SDNet/env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/home/SDNet/Models/SDNet.py", line 253, in forward
x_highlvl_output = self.high_lvl_context_rnn(torch.cat([x_rnn_after_inter_attn, x_self_attn_output], 2), x_mask)
File "/home/raisudeen/SDNet/env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/home/SDNet/Models/Layers.py", line 163, in forward
rnn_output = self.rnnsi[0]
File "/home/SDNet/env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/home/crm-di/raisudeen/SDNet/env/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 192, in forward
output, hidden = func(input, self.all_weights, hx, batch_sizes)
File "/home/SDNet/env/lib/python3.7/site-packages/torch/nn/_functions/rnn.py", line 324, in forward
return func(input, *fargs, **fkwargs)
File "/home/SDNet/env/lib/python3.7/site-packages/torch/nn/_functions/rnn.py", line 288, in forward
dropout_ts)
RuntimeError: CUDA error: out of memory

I have tried changing the mini batch size from 32 to all the way to 2 (tried 16,8,4 as well). I am using GTX 1080 Ti GPU. Is there anything else that I can do to?

microsoft / sdnet Goto Github PK

sdnet's Introduction

SDNet

Directory structure:

How to run

Contact

sdnet's People

Contributors

Stargazers

Watchers

Forkers

sdnet's Issues

Recommend Projects

Recommend Topics

Recommend Org