Giter VIP home page Giter VIP logo

discobert's Introduction

💃DiscoBERT: Discourse-Aware Neural Extractive Text Summarization

Code repository for an ACL 2020 paper Discourse-Aware Neural Extractive Text Summarization.

Authors: Jiacheng Xu (University of Texas at Austin), Zhe Gan, Yu Cheng, and Jingjing Liu (Microsoft Dynamics 365 AI Research).

Contact: jcxu at cs dot utexas dot edu

Illustration

EDU Segmentation & Parsing

Here is an example of discourse segmentation and RST tree conversion.

Construction of Graphs: An Example

The proposed discourse-aware model selects EDUs 1-1, 2-1, 5-2, 20-1, 20-3, 22-1. The right side of the figure illustrates the two discourse graphs we use: (1) Coref(erence) Graph (with the mentions of `Pulitzer prizes' highlighted as examples); and (2) RST Graph (induced by RST discourse trees).

Prerequisites

The code is based on AllenNLP (v0.9), The code is developed with python 3, allennlp and pytorch>=1.0. For more requirements, please check requirements.txt.

Preprocessed Dataset & Model Archive

We maintain the preprocessed CNNDM, pre-trained CNNDM model w. discourse graph and coref graph, and pre-trained NYT model w. discourse graph and coref graph are provided in https://utexas.box.com/v/DiscoBERT-ACL2020.

The split of NYT is provided at data_preparation/urls_nyt/mapping_{train, valid, test}.txt.

Training

The model framework (training, evaluation, etc.) is based on AllenNLP (v0.9). The usage of most framework related hyper-parameters, like batch size, cuda device, num of samples per epoch, can be referred to AllenNLP document.

Here are some model related hyper-parameters:

Hyper-parameter Value Usage
use_disco bool Using EDU as the selection unit or not. If not use sentence instead.
trigram_block bool Using trigram blocking or not.
min_pred_unit & max_pred_unit int The minimal and maximal number of units (either EDUs or sentences) to choose during inference. The typical value for selecting EDUs on CNNDM and NYT is [5,8) and [5,8).
use_disco_graph bool Using discourse graph for graph encoding.
use_coref bool Using coreference mention graph for graph encoding.

Comments:

  • The hyper-parameters for BERT encoder is almost same as the configuration from PreSumm.
  • Inflating the number of units getting predicted for EDU-based models because EDUs are generally shorter than sentences. For CNNDM, we found that picking up 5 EDUs yields the best ROUGE F-1 score where for sentence-based model four sentences are picked.
  • We hardcoded some of vector dimension size to be 768 because we use bert-base-uncased model.
  • We tried roberta-base rather than bert-base-uncased as we used in this code repo and paper, but empirically it didn't perform better in our preliminary experiments.
  • The maxium document length is set to be 768 BPEs although we found max_len=768 doesn't bring significant gain from max_len=512.

To train or modify a model, there are several files to start with.

  • model/disco_bert.py is the model file. There are some unused conditions and hyper-parameters starting with "semantic_red" so you should ignore them.
  • configs/DiscoBERT.jsonnet is the configuration file which will be read by AllenNLP framework. In the pre-trained model section of https://utexas.box.com/v/DiscoBERT-ACL2020, we provided the configuration files for reference. Basically we adopted most of the hyper-parameters from PreSumm.

Here is a quick reference about our model performance based on bert-base-uncased.

CNNDM:

Model R1/R2/RL
DiscoBERT 43.38/20.44/40.21
DiscoBERT w. RST and Coref Graphs 43.77/20.85/40.67

NYT:

Model R1/R2/RL
DiscoBERT 49.78/30.30/42.44
DiscoBERT w. RST and Coref Graphs 50.00/30.38/42.70

Citing

@inproceedings{xu-etal-2020-discourse,
    title = {Discourse-Aware Neural Extractive Text Summarization},
    author = {Xu, Jiacheng and Gan, Zhe and Cheng, Yu and Liu, Jingjing},
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    year = {2020},
    publisher = "Association for Computational Linguistics"
}

Acknowledgements

  • The data preprocessing (dataset handler, oracle creation, etc.) is partially based on PreSumm by Yang Liu and Mirella Lapata.
  • Data preprocessing (tokenization, sentence split, coreference resolution etc.) used CoreNLP.
  • RST Discourse Segmentation is generated from NeuEDUSeg. I slightly modified the code to run with GPU. Please check my modification here.
  • RST Discourse Parsing is generated from DPLP. My customized version is here featuring batch implementation and remaining file detection. Empirically I found that NeuEDUSeg provided better segmentation output than DPLP so we use NeuEDUSeg for segmentation and DPLP for parsing.
  • The implementation of the graph module is based on DGL.

discobert's People

Contributors

jiacheng-xu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

discobert's Issues

Corrupted CNNDM model file

Hello,

It seems the CNNDM model file available for download is corrupted. Executing the following command:

tar -zxvf model.tar.gz 

Results in the following error message:

gzip: stdin: unexpected end of file
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now

Readme

Hi, I'm very interesting in your job, but it's not so easy for me to run your code successfully with your readme, will you please update your readme with your command for training, evaluating and testing with your parameters as well.

TypeError while running data_reader.py

Environment
allennlp == 0.9.0
python == 3.6.12
torch == 1.5.0

I ran data_reader.py and it occurred an error as following,

Using backend: pytorch
Traceback (most recent call last):
File "data_reader.py", line 2, in
dataset_reader = data_reader.CNNDMDatasetReader()
File "/home/carol/DiscoBERT/model/data_reader.py", line 114, in init
self._token_indexers = token_indexers['bert'] or {"tokens": SingleIdTokenIndexer()}
TypeError: 'PretrainedBertIndexer' object is not subscriptable

And I found that there wasn't a key named 'bert' in token_indexers, that's why the problem occurred. I wonder If the bert model has been correctly obtained by the following code.
token_indexers: Dict[str, TokenIndexer] = PretrainedBertIndexer("bert-base-uncased")

And here is the content in token_indexers,

token_indexers: {'_token_min_padding_length': 0, 'vocab': OrderedDict([('[PAD]', 0), ... , ('##~', 30521)]), 'wordpiece_tokenizer': <bound method WordpieceTokenizer.tokenize of <pytorch_pretrained_bert.tokenization.WordpieceTokenizer object at 0x7f008d3d84a8>>, '_namespace': 'bert', '_added_to_vocabulary': False, 'max_pieces': 512, 'use_starting_offsets': False, '_do_lowercase': True, '_truncate_long_sequences': True, '_warned_about_truncation': False, '_never_lowercase': {'[SEP]', '[PAD]', '[UNK]', '[CLS]', '[MASK]'}, '_start_piece_ids': [101], '_end_piece_ids': [102], '_separator_ids': [102]}

关于作者给出代码中Bert的最大输入长度与论文中不符

作者代码:
if bert_max_length > 512:
first_half = self.embedder.bert_model.embeddings.position_embeddings.weight#前一半512个字符
# ts = torch.zeros_like(first_half, dtype=torch.float32)
# second_half = ts.new_tensor(first_half, requires_grad=True)

        second_half = torch.zeros_like(first_half, dtype=torch.float32, requires_grad=True)

        # second_half = torch.empty(first_half.size(), dtype=torch.float32,requires_grad=True)
        # torch.nn.init.normal_(second_half, mean=0.0, std=1.0)
        out = torch.cat([first_half, second_half], dim=0)#那么这里的最大position应该是512*2了吧,与文中不符啊
        self.embedder.bert_model.embeddings.position_embeddings.weight = torch.nn.Parameter(out)
        self.embedder.bert_model.embeddings.position_embeddings.num_embeddings = 512 * 2
        self.embedder.max_pieces = 512 * 2

这里很明显最大长度是1024啊,作者论文中说的是768, @jiacheng-xu

How to calculate oracle scores

I tried to reproduce the Oracle (Discourse) score for CNN/DM in the paper, but it did not work.
In the paper, ROUGE-1 and ROUGE-2 are reported as 61.61 and 37.82, respectively, but when I calculated them, ROUGE-1 and ROUGE-2 were 55.10 and 32.22, respectively.
Could you please tell me more details on how to calculate this?

Unable to decompress the pretrain model.

I've installed allennlp in order to run this model, but the model.tar.gz is corrupted which is from https://utexas.app.box.com/v/DiscoBERT-ACL2020/folder/110710107654. So when I try to run allennlp predict model/model.tar.gz CNNDM/chunk/test/ , there occurs an error as follows,

  File "/home/anaconda3/envs/lib/python3.6/gzip.py", line 482, in read
    raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached

Does anyone know how to fix this? Thank you.

Test output

Hi, do you have the summary output of the test set?

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.