jiacheng-xu / discobert Goto Github PK

View Code? Open in Web Editor NEW

165.0 10.0 30.0 17.29 MB

Code for paper "Discourse-Aware Neural Extractive Text Summarization" (ACL20)

License: MIT License

Jsonnet 6.94% Python 91.48% Shell 1.58%

natural-language-processing text-summarization bert-model microsoft-dynamics-365 acl2020

discobert's Introduction

💃DiscoBERT: Discourse-Aware Neural Extractive Text Summarization

Code repository for an ACL 2020 paper Discourse-Aware Neural Extractive Text Summarization.

Authors: Jiacheng Xu (University of Texas at Austin), Zhe Gan, Yu Cheng, and Jingjing Liu (Microsoft Dynamics 365 AI Research).

Contact: jcxu at cs dot utexas dot edu

Illustration

EDU Segmentation & Parsing

Here is an example of discourse segmentation and RST tree conversion.

Construction of Graphs: An Example

The proposed discourse-aware model selects EDUs 1-1, 2-1, 5-2, 20-1, 20-3, 22-1. The right side of the figure illustrates the two discourse graphs we use: (1) Coref(erence) Graph (with the mentions of `Pulitzer prizes' highlighted as examples); and (2) RST Graph (induced by RST discourse trees).

Prerequisites

The code is based on AllenNLP (v0.9), The code is developed with python 3, allennlp and pytorch>=1.0. For more requirements, please check requirements.txt.

Preprocessed Dataset & Model Archive

We maintain the preprocessed CNNDM, pre-trained CNNDM model w. discourse graph and coref graph, and pre-trained NYT model w. discourse graph and coref graph are provided in https://utexas.box.com/v/DiscoBERT-ACL2020.

The split of NYT is provided at data_preparation/urls_nyt/mapping_{train, valid, test}.txt.

Training

The model framework (training, evaluation, etc.) is based on AllenNLP (v0.9). The usage of most framework related hyper-parameters, like batch size, cuda device, num of samples per epoch, can be referred to AllenNLP document.

Here are some model related hyper-parameters:

Hyper-parameter	Value	Usage
`use_disco`	bool	Using EDU as the selection unit or not. If not use sentence instead.
`trigram_block`	bool	Using trigram blocking or not.
`min_pred_unit` & `max_pred_unit`	int	The minimal and maximal number of units (either EDUs or sentences) to choose during inference. The typical value for selecting EDUs on CNNDM and NYT is [5,8) and [5,8).
`use_disco_graph`	bool	Using discourse graph for graph encoding.
`use_coref`	bool	Using coreference mention graph for graph encoding.

Comments:

The hyper-parameters for BERT encoder is almost same as the configuration from PreSumm.
Inflating the number of units getting predicted for EDU-based models because EDUs are generally shorter than sentences. For CNNDM, we found that picking up 5 EDUs yields the best ROUGE F-1 score where for sentence-based model four sentences are picked.
We hardcoded some of vector dimension size to be 768 because we use bert-base-uncased model.
We tried roberta-base rather than bert-base-uncased as we used in this code repo and paper, but empirically it didn't perform better in our preliminary experiments.
The maxium document length is set to be 768 BPEs although we found max_len=768 doesn't bring significant gain from max_len=512.

To train or modify a model, there are several files to start with.

model/disco_bert.py is the model file. There are some unused conditions and hyper-parameters starting with "semantic_red" so you should ignore them.
configs/DiscoBERT.jsonnet is the configuration file which will be read by AllenNLP framework. In the pre-trained model section of https://utexas.box.com/v/DiscoBERT-ACL2020, we provided the configuration files for reference. Basically we adopted most of the hyper-parameters from PreSumm.

Here is a quick reference about our model performance based on bert-base-uncased.

CNNDM:

Model	R1/R2/RL
DiscoBERT	43.38/20.44/40.21
DiscoBERT w. RST and Coref Graphs	43.77/20.85/40.67

NYT:

Model	R1/R2/RL
DiscoBERT	49.78/30.30/42.44
DiscoBERT w. RST and Coref Graphs	50.00/30.38/42.70

Citing

@inproceedings{xu-etal-2020-discourse,
    title = {Discourse-Aware Neural Extractive Text Summarization},
    author = {Xu, Jiacheng and Gan, Zhe and Cheng, Yu and Liu, Jingjing},
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    year = {2020},
    publisher = "Association for Computational Linguistics"
}

Acknowledgements

The data preprocessing (dataset handler, oracle creation, etc.) is partially based on PreSumm by Yang Liu and Mirella Lapata.
Data preprocessing (tokenization, sentence split, coreference resolution etc.) used CoreNLP.
RST Discourse Segmentation is generated from NeuEDUSeg. I slightly modified the code to run with GPU. Please check my modification here.
RST Discourse Parsing is generated from DPLP. My customized version is here featuring batch implementation and remaining file detection. Empirically I found that NeuEDUSeg provided better segmentation output than DPLP so we use NeuEDUSeg for segmentation and DPLP for parsing.
The implementation of the graph module is based on DGL.

discobert's People

Contributors

Stargazers

Watchers

discobert's Issues

Corrupted CNNDM model file

Hello,

It seems the CNNDM model file available for download is corrupted. Executing the following command:

tar -zxvf model.tar.gz

Results in the following error message:

gzip: stdin: unexpected end of file
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now

Readme

Hi, I'm very interesting in your job, but it's not so easy for me to run your code successfully with your readme, will you please update your readme with your command for training, evaluating and testing with your parameters as well.

TypeError while running data_reader.py

Environment
allennlp == 0.9.0
python == 3.6.12
torch == 1.5.0

I ran data_reader.py and it occurred an error as following,

Using backend: pytorch
Traceback (most recent call last):
File "data_reader.py", line 2, in
dataset_reader = data_reader.CNNDMDatasetReader()
File "/home/carol/DiscoBERT/model/data_reader.py", line 114, in init
self._token_indexers = token_indexers['bert'] or {"tokens": SingleIdTokenIndexer()}
TypeError: 'PretrainedBertIndexer' object is not subscriptable

And I found that there wasn't a key named 'bert' in token_indexers, that's why the problem occurred. I wonder If the bert model has been correctly obtained by the following code.
token_indexers: Dict[str, TokenIndexer] = PretrainedBertIndexer("bert-base-uncased")

And here is the content in token_indexers,

token_indexers: {'_token_min_padding_length': 0, 'vocab': OrderedDict([('[PAD]', 0), ... , ('##～', 30521)]), 'wordpiece_tokenizer': <bound method WordpieceTokenizer.tokenize of <pytorch_pretrained_bert.tokenization.WordpieceTokenizer object at 0x7f008d3d84a8>>, '_namespace': 'bert', '_added_to_vocabulary': False, 'max_pieces': 512, 'use_starting_offsets': False, '_do_lowercase': True, '_truncate_long_sequences': True, '_warned_about_truncation': False, '_never_lowercase': {'[SEP]', '[PAD]', '[UNK]', '[CLS]', '[MASK]'}, '_start_piece_ids': [101], '_end_piece_ids': [102], '_separator_ids': [102]}

Long documents

Would it be possible to summarize documents with length > 758 tokens?
Using https://github.com/allenai/longformer could be interesting for that use-case.

您好，能用于中文吗？

Hi，can i use your code for Chinese task? Moreover, the link you provided is unavailable now

doc_oracle cannot import cal_rouge

you have "#from ,...import cal_rouge", but in the code the function has appeared!

关于作者给出代码中Bert的最大输入长度与论文中不符

作者代码：
if bert_max_length > 512:
first_half = self.embedder.bert_model.embeddings.position_embeddings.weight#前一半512个字符
# ts = torch.zeros_like(first_half, dtype=torch.float32)
# second_half = ts.new_tensor(first_half, requires_grad=True)

        second_half = torch.zeros_like(first_half, dtype=torch.float32, requires_grad=True)

        # second_half = torch.empty(first_half.size(), dtype=torch.float32,requires_grad=True)
        # torch.nn.init.normal_(second_half, mean=0.0, std=1.0)
        out = torch.cat([first_half, second_half], dim=0)#那么这里的最大position应该是512*2了吧，与文中不符啊
        self.embedder.bert_model.embeddings.position_embeddings.weight = torch.nn.Parameter(out)
        self.embedder.bert_model.embeddings.position_embeddings.num_embeddings = 512 * 2
        self.embedder.max_pieces = 512 * 2

这里很明显最大长度是1024啊，作者论文中说的是768， @jiacheng-xu

How did you connect segmentation results of NeuralEDUSeg to DPLP?

Hi, how did you connect segmentation results of NeuralEDUSeg, which are text files with one EDU per line, to DPLP for parsing, which requires .merge files? Thanks.

Run the code to generate a summary for a text

Hi,
Great work in article summarization. Can you kindly tell me, if I want to generate the summary for a single random article, which file should I run?

How to calculate oracle scores

I tried to reproduce the Oracle (Discourse) score for CNN/DM in the paper, but it did not work.
In the paper, ROUGE-1 and ROUGE-2 are reported as 61.61 and 37.82, respectively, but when I calculated them, ROUGE-1 and ROUGE-2 were 55.10 and 32.22, respectively.
Could you please tell me more details on how to calculate this?

Unable to decompress the pretrain model.

I've installed allennlp in order to run this model, but the model.tar.gz is corrupted which is from https://utexas.app.box.com/v/DiscoBERT-ACL2020/folder/110710107654. So when I try to run allennlp predict model/model.tar.gz CNNDM/chunk/test/ , there occurs an error as follows,

  File "/home/anaconda3/envs/lib/python3.6/gzip.py", line 482, in read
    raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached

Does anyone know how to fix this? Thank you.

Test output

Hi, do you have the summary output of the test set?

Thanks