Giter VIP home page Giter VIP logo

bottom-up-summary's Introduction

Bottom-Up Summarization

This repository describes the process of including Bottom-Up Attention inside your abstractive summarization model. If you are interested in downloading predictions, models or others, please look at the bottom of the page.

The article will appear in the proceedings of EMNLP 2018. A preprint is available here: https://arxiv.org/pdf/1808.10792.pdf

If you cite this work, please use the following bibtex:

@article{gehrmann2018bottom,
  title={Bottom-Up Abstractive Summarization},
  author={Gehrmann, Sebastian and Deng, Yuntian and Rush, Alexander M},
  journal={arXiv preprint arXiv:1808.10792},
  year={2018}
}

Overview over the whole process

Image showing the process

Individual steps

(a) Train abstractive model on full data

Please follow the instructions here to train the Pointer-Generator model with Coverage Penalty: http://opennmt.net/OpenNMT-py/Summarization.html

Results without Content Selector

CNNDM: R1 39.02, R2 17.25, RL 36.05

Gigaword (Results without penalty): R1 35.51, R2 17.35, RL 33.17

NYT: R1 45.13, R2 30.13, RL 39.67

(b) Create content-selection dataset

Allennlp requires a specific format of the training data. We provide a script to process a dataset comprising line-separated examples in the form src.txt and tgt.txt.

Commands

Step 1 - shuffle the data:

mkfifo onerandom tworandom
tee onerandom tworandom < /dev/urandom > /dev/null &
shuf --random-source=onerandom ./src.txt > ./src.txt.shuf &
shuf --random-source=tworandom ./tgt.txt > ./tgt.txt.shuf &
wait

Step 2 - create data formatted for allennlp

python preprocess_copy.py -src $srcpath
                          -tgt $tgtpath
                          -output data/processed/multicopy.XXX
                          -prune 400 (Max number of words in a document)
                          -num_examples 100000 (100k should be enough for convergence)

Preprocessing code can be found in Extractive Preprocessing.ipynb.

(c) Train allenlp tagging model

Commands

Model configuration files are in the folder allennlp_config. Modify the lines about file locations and cuda device before running an experiment.

To train a model, run the command

python -m allennlp.run train 
                       allennlp_config/$filename.json 
                       --serialization-dir $output_folder

Make sure to use a different $output_folder for each experiment to prevent accidentally overwriting and reusing models.

There are multiple different configurations in the folder:

  • tagger_simple: tagging model with convolutional character encodings and bidirectional LSTM
  • tagger_elmo: tagging model with ElMo + standard word encodings and bidirectional LSTM
  • tagger_CRF: uses a CRF on top of the model to calculate transitions between states

(d) Run the Content-Selector

Commands

During preprocessing, we create a file named *.src.txt. This one can be used to run inference with the trained model.

python -m allennlp.run predict 
                       $modelfile 
                       $datafile 
                       --output $outputfile 
                       --cuda-device 0 
                       --batch-size 50

(e) Use Content-Selector as Extractive Summarizer

One option is to directly use the trained Content-Selector as extractive model. We created a script that takes care of this called prediction_to_text.py.

The script can also be used to evaluate against the gold targets as created by the preprocessing by setting tgt. You can switch between extraction of sentences and phrases by using the style parameter. If you want additional indicators in between extracted phrases, use divider. The threshold for the extraction of phrases can be set by threshold. Finally, we provide a prune option to clip the number of words in an input (you want to use the same number of words as in preprocessing for best results).

Commands

To run, call

python prediction_to_text.py -data $predictionfile \
                             -output $outfname \
                             -tgt $tgtfile [optional, prints F1, AUC etc.] \
                             -threshold 0.25 \
                             -divider "" \
                             -style [sentences, phrases, threesent] \
                             -prune 400

Results

CNNDM with 3 sentences: R1 40.7, R2 18.0, RL 37.0

CNNDM with phrases: R1 42.0, R2 15.9, RL 37.3

(f) Use probabilities in Bottom-Up Attention

You can find outputs of our model here: https://drive.google.com/file/d/1EqiEVt3H7z7oCQBKkCO7MXoJkXM7Cipr/view?usp=sharing

We are currently working on documenting the code to combine the allennlp output and the OpenNMT model. If you want to run the inference, please download this branch: https://github.com/sebastianGehrmann/OpenNMT-py/tree/copy_constraint

You will need to run this command:

python translate.py -model $model_PATH                      # You can use a model downloaded from the link above
                    -src $CNNDM_test_input_PATH             # See download link below
                    -constraint_file $allennlp_output_PATH  # Follow instructions or see download link below
                    -threshold $BOTTOM_UP_THRESHOLD         # We found numbers between 0.1 and 0.2 to work. Our reported numbers use 0.15
                    -batch_size 1                           # Currently non-batched, sorry!
                    -min_length 35          
                    -stepwise_penalty 
                    -coverage_penalty summary 
                    -beta 5 
                    -length_penalty wu 
                    -alpha 0.9 
                    -block_ngram_repeat 3 
                    -ignore_when_blocking "." "" "" 
                    -output $prediction_PATH  
                    -gpu $gpuid

Downloadable Content

Note that our predictions have sentence tags <t> and </t> which need to be removed for ROUGE scoring. To reproduce our numbers, please follow the evaluation instructions here.

  1. Model: https://s3.amazonaws.com/opennmt-models/Summary/ada6_bridge_oldcopy_tagged_larger_acc_54.84_ppl_10.58_e17.pt
  2. allennlp input: https://drive.google.com/file/d/1TNGGBX7iAgvkfyFsDzPKvlbWd4vmCVTR/view?usp=sharing
  3. allennlp output: https://drive.google.com/file/d/1IBByzlLwj_JKy-V_mB7563HtRYZilsOl/view?usp=sharing
  4. Bottom-Up Attention input: https://drive.google.com/file/d/1k-LqK3Lt7czIKyVrH_tr3P3Qd_39gLhk/view?usp=sharing
  5. Bottom-Up Attention output: https://drive.google.com/file/d/1EqiEVt3H7z7oCQBKkCO7MXoJkXM7Cipr/view?usp=sharing

bottom-up-summary's People

Contributors

amulder avatar bckim92 avatar sebastiangehrmann avatar shujian2015 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bottom-up-summary's Issues

Would you like to provide the original OpenNMT you use?

Thank you for your work. I learn from you paper that you use the original OpenNMT when training. Is it right? Would you like to provide the version of OpenNMT you use? Or could you tell me how do you get the model in "Downloadable Content"?

src & tgt

Hi, @sebastianGehrmann
Your work looks really fascinating and it's what i'm looking for. However, I'm a bit confused when I'm trying to reproduce the results. Should tgt refers to the summary and src refers to the article content?
Thanks in advance.

multicopy.XXX

where I can find this file data/processed/multicopy.XXX ?
And what does it mean?

TypeError: __init__() got an unexpected keyword argument 'tensor_type

First of all, thanks for uploading pretrained model and prediction outputs. It's really helpful.

To produce prediction output using pretrained model, I am using following command but I am getting given below error. There seems to be model-loading issue. Could you please help on this? Also, did I provide the correct parameters?

Thanks.

Command:

python translate.py -model ./bottomUpModel/ada6_bridge_oldcopy_tagged_larger_acc_54.84_ppl_10.58_e17.pt \
                    -src ./bottomUpModel/test.txt.src \
                    -constraint_file ./bottomUpModel/allennlp_test_output.txt \
                    -threshold 0.15 \
                    -batch_size 1 \
                    -min_length 35 \
                    -stepwise_penalty \
                    -coverage_penalty summary \
                    -beta 5 \
                    -length_penalty wu \
                    -alpha 0.9 \
                    -block_ngram_repeat 3 \
                    -ignore_when_blocking "." "" "" \
                    -output ./bottomUpModel/prediction \
                    -gpu 0

Error:

Traceback (most recent call last):
  File "translate.py", line 162, in <module>
    main()
  File "translate.py", line 67, in main
    onmt.ModelConstructor.load_test_model(opt, dummy_opt.__dict__)
  File "./bottomUpModel/OpenNMT-py-copy_constraint/onmt/ModelConstructor.py", line 121, in load_test_model
    checkpoint['vocab'], data_type=opt.data_type)
  File "./bottomUpModel/OpenNMT-py-copy_constraint/onmt/io/IO.py", line 57, in load_fields_from_vocab
    fields = get_fields(data_type, n_src_features, n_tgt_features)
  File "./bottomUpModel/OpenNMT-py-copy_constraint/onmt/io/IO.py", line 43, in get_fields
    return TextDataset.get_fields(n_src_features, n_tgt_features)
  File "./bottomUpModel/OpenNMT-py-copy_constraint/onmt/io/TextDataset.py", line 231, in get_fields
    postprocessing=make_src, sequential=False)
TypeError: __init__() got an unexpected keyword argument 'tensor_type'

Extremely Lower R-LCS

Hi, I evaluated the given output with the pyrouge package by first removing the <t> and </t>.
The complete result is:

ROUGE-1:
rouge_1_f_score: 0.4153 with confidence interval (0.4131, 0.4176)
rouge_1_recall: 0.4261 with confidence interval (0.4234, 0.4289)
rouge_1_precision: 0.4317 with confidence interval (0.4293, 0.4343)

ROUGE-2:
rouge_2_f_score: 0.1877 with confidence interval (0.1855, 0.1899)
rouge_2_recall: 0.1929 with confidence interval (0.1905, 0.1952)
rouge_2_precision: 0.1953 with confidence interval (0.1929, 0.1977)

ROUGE-l:
rouge_l_f_score: 0.2792 with confidence interval (0.2772, 0.2813)
rouge_l_recall: 0.2872 with confidence interval (0.2848, 0.2896)
rouge_l_precision: 0.2897 with confidence interval (0.2875, 0.2920)

R-1 and R-2 seem similar to the paper, but why ROUGE-LCS is pretty low?

Extractive Preprocessing.ipynb is missing

Hi,

I cannot find the Extractive Preprocessing.ipynb, which is mentioned at (b) step in README.md. I think the file is missing. Can you upload that notebook script?

Thanks!

Required Format for step-a inputs

CNN-DM dataset contains stories with complete article ending with its summary (highlights). For training this data in OpenNMT, the preprocessing command is:

python preprocess.py -train_src data/cnndm/train.txt.src \
                     -train_tgt data/cnndm/train.txt.tgt \

What is the required format of src-tgt for train.txt.src and train.txt.tgt files. Should I put the whole article in one line and whole summary in one line in respective source and target files?

  • Could you please also share ETA for complete documentation?

Is the emlo_tagger.json complete?

Hi,
I got the following error when using the emlo_tagger configuration.

Traceback (most recent call last):
File "/home/server3/.conda/envs/pyth_3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/server3/.conda/envs/pyth_3/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/server3/.conda/envs/pyth_3/lib/python3.6/site-packages/allennlp/run.py", line 18, in
main(prog="allennlp")
File "/home/server3/.conda/envs/pyth_3/lib/python3.6/site-packages/allennlp/commands/init.py", line 72, in main
args.func(args)
File "/home/server3/.conda/envs/pyth_3/lib/python3.6/site-packages/allennlp/commands/train.py", line 111, in train_model_from_args
args.force)
File "/home/server3/.conda/envs/pyth_3/lib/python3.6/site-packages/allennlp/commands/train.py", line 142, in train_model_from_file
return train_model(params, serialization_dir, file_friendly_logging, recover, force)
File "/home/server3/.conda/envs/pyth_3/lib/python3.6/site-packages/allennlp/commands/train.py", line 298, in train_model
model = Model.from_params(vocab=vocab, params=params.pop('model'))
File "/home/server3/.conda/envs/pyth_3/lib/python3.6/site-packages/allennlp/common/from_params.py", line 274, in from_params
return subclass.from_params(params=params, **extras)
File "/home/server3/.conda/envs/pyth_3/lib/python3.6/site-packages/allennlp/common/from_params.py", line 285, in from_params
kwargs = create_kwargs(cls, params, **extras)
File "/home/server3/.conda/envs/pyth_3/lib/python3.6/site-packages/allennlp/common/from_params.py", line 150, in create_kwargs
raise ConfigurationError(f"expected key {name} for {cls.name}")
allennlp.common.checks.ConfigurationError: 'expected key encoder for SimpleTagger'
[INFO/MainProcess] process shutting down

How can I fix it?

json.decoder.JSONDecodeError: Extra data problem in allennlp prediction

Hey,

I'd like to train content-selector on my dataset. After training with allennlp with the specified options, I tried to run prediction on test set which I generated using preprocess_copy.py to transform into a new format. Then, I run the prediction command:

CUDA_VISIBLE_DEVICES=0,1 python -m allennlp.run predict models/model.tar.gz data/multicopy.test.src.txt --output out/prediction.txt --cuda-device 1 --batch-size 10

But the odds are not in my favor, and the prediction ends up with this error message after prediction:

Traceback (most recent call last):
File "/home/sajad/anaconda3/envs/allennlp/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/sajad/anaconda3/envs/allennlp/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/sajad/anaconda3/envs/allennlp/lib/python3.6/site-packages/allennlp/run.py", line 21, in
run()
File "/home/sajad/anaconda3/envs/allennlp/lib/python3.6/site-packages/allennlp/run.py", line 18, in run
main(prog="allennlp")
File "/home/sajad/anaconda3/envs/allennlp/lib/python3.6/site-packages/allennlp/commands/init.py", line 101, in main
args.func(args)
File "/home/sajad/anaconda3/envs/allennlp/lib/python3.6/site-packages/allennlp/commands/predict.py", line 200, in _predict
manager.run()
File "/home/sajad/anaconda3/envs/allennlp/lib/python3.6/site-packages/allennlp/commands/predict.py", line 179, in run
for batch_json in lazy_groups_of(self._get_json_data(), self._batch_size):
File "/home/sajad/anaconda3/envs/allennlp/lib/python3.6/site-packages/allennlp/common/util.py", line 105, in
return iter(lambda: list(islice(iterator, 0, group_size)), [])
File "/home/sajad/anaconda3/envs/allennlp/lib/python3.6/site-packages/allennlp/commands/predict.py", line 162, in _get_json_data
yield self._predictor.load_line(line)
File "/home/sajad/anaconda3/envs/allennlp/lib/python3.6/site-packages/allennlp/predictors/predictor.py", line 45, in load_line
return json.loads(line)
File "/home/sajad/anaconda3/envs/allennlp/lib/python3.6/json/init.py", line 354, in loads
return _default_decoder.decode(s)
File "/home/sajad/anaconda3/envs/allennlp/lib/python3.6/json/decoder.py", line 342, in decode
raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 3 (char 2)
2019-03-30 15:44:53,390 - INFO - allennlp.models.archival - removing temporary unarchived model dir at /tmp/tmpmdt8z2s3

Any idea to fix this?

RuntimeError: The size of tensor a (400) must match the size of tensor b (0) at non-singleton dimension 1

When I use the following script trains the transformer model on CNN-DMhttp://opennmt.net/OpenNMT-py/Summarization.html
onmt_train -data data/cnndm/CNNDM \ -save_model models/cnndm \ -layers 4 \ -rnn_size 512 \ -word_vec_size 512 \ -max_grad_norm 0 \ -optim adam \ -encoder_type transformer \ -decoder_type transformer \ -position_encoding \ -dropout 0\.2 \ -param_init 0 \ -warmup_steps 8000 \ -learning_rate 2 \ -decay_method noam \ -label_smoothing 0.1 \ -adam_beta2 0.998 \ -batch_size 4096 \ -batch_type tokens \ -normalization tokens \ -max_generator_batches 2 \ -train_steps 200000 \ -accum_count 4 \ -share_embeddings \ -copy_attn \ -param_init_glorot \ -world_size 2 \ -gpu_ranks 0 1
I meet the following error:
Traceback (most recent call last):
File "train.py", line 438, in
main()
File "train.py", line 430, in main
train_model(model, fields, optim, data_type, model_opt)
File "train.py", line 252, in train_model
train_stats = trainer.train(train_iter, epoch, report_func)
File "/home/cai/yym/ddl/final/OpenNMT-py-copy_constraint/onmt/Trainer.py", line 178, in train
report_stats, normalization)
File "/home/cai/yym/ddl/final/OpenNMT-py-copy_constraint/onmt/Trainer.py", line 311, in _gradient_accumulation
trunc_size, self.shard_size, normalization)
File "/home/cai/yym/ddl/final/OpenNMT-py-copy_constraint/onmt/Loss.py", line 123, in sharded_compute_loss
loss, stats = self._compute_loss(batch, **shard)
File "/home/cai/yym/ddl/final/OpenNMT-py-copy_constraint/onmt/modules/CopyGenerator.py", line 201, in _compute_loss
batch.src_map)
File "/home/cai/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/cai/yym/ddl/final/OpenNMT-py-copy_constraint/onmt/modules/CopyGenerator.py", line 99, in forward
mul_attn = torch.mul(attn, tags) * 2
RuntimeError: The size of tensor a (400) must match the size of tensor b (0) at non-singleton dimension 1

STH about the generation of the content selection task

HEY guy, I have a question about the generation of content selection task's data in your paper.
What do u mean by "its part of the LONGEST possible subsequence of tokens". and I can't get the meaning of "s = x_{i-j:i:i+k}".Could u please help me with this generation process?

Preprocessing step using OPENNMT-py gives 0 features?

Hi,

My professor Filip G. has talked to you on EMNLP and he told me that I should try this paper results (telling that it is easy to repeat the procedure). I got stuck on the 1st step somehow. Would be glad, if you can help me here.

So, so suggested in the 1st step in http://opennmt.net/OpenNMT-py/Summarization.html, I do preprocessing (same command line for preprocessing CNNDM dataset)

Then I get 0 number of features:

[2018-11-08 13:25:06,615 INFO] Extracting features...
[2018-11-08 13:25:06,676 INFO] * number of source features: 0.
[2018-11-08 13:25:06,676 INFO] * number of target features: 0.
[2018-11-08 13:25:06,676 INFO] Building Fields object...
[2018-11-08 13:25:06,676 INFO] Building & saving training data...

While training is finished, I receive an error, saying that Vocabulary size is 0.

How should I approach the problem and make sure that I do not get 0 features?

License

Hi Sebastian, I'd like to know what type of license for the repo is. Could you please add it?
Thank you!

Have trouble in generating content selection training data

Hi,
As mentioned in the title, I would like to know how you tag the training data in the content selection step. I understand that you do it by aligning the summaries to the document, but I just couldn't reproduce the training data on my own by using the method described in the paper. And I am a bit confused about your labeled data. Here's one example:
The original document is:
'-lrb- cnn -rrb- relations between iran and saudi arabia have always been thorny , but rarely has the state of affairs been as venomous as it is today . tehran and riyadh each point to the other as the main reason for much of the turmoil in the middle east . in its most recent incarnation , the iranian-saudi conflict by proxy has reached yemen in a spiral that both sides portray as climatic . for riyadh and its regional allies , the saudi military intervention in yemen -- operation decisive storm '' -- is the moment the sunni arab nation finally woke up to repel the expansion of shia-iranian influence . for tehran and its regional allies -- including the houthi movement in yemen -- saudi arabia \'s actions are in defense of a retrogressive status quo order that is no longer tenable . and yet both sides have good reasons to want to stop the yemeni crisis from spiraling out of control and evolving into an unwinnable war . when iranian president hassan rouhani was elected in june 2013 , he pledged to reach out to riyadh . he was up front and called tehran \'s steep deterioration of relations with the saudis over the last decade as one of the principal burdens on iranian foreign policy . from lebanon and afghanistan to pakistan and the gaza strip , the iranian-saudi rivalry and conflict through proxy has been deep and costly . and yet despite rouhani \'s open pledge , profound differences over syria and iraq in particular have kept riyadh and tehran apart . but if the questions of syria and iraq prevented a pause in hostilities , the saudi military intervention in yemen since late march has all but raised the stakes to unprecedentedly dangerous levels . unlike in syria and in iraq , the saudi military is now directly battling it out with iranian-backed rebels in yemen . while riyadh no doubt exaggerates tehran \'s role in the yemen crisis , its fingerprints are nonetheless evident . iran provides financial support , weapons , training and intelligence to houthis , '' gerald feierstein , a u.s. state department official and former yemen ambassador , told a congressional hearing last week . `` we believe that iran sees opportunities with the houthis to expand its influence in yemen and threaten saudi and gulf arab'

and the ground-truth summary is:
' vatanka : tensions between iran and saudi arabia are at an unprecedented level . iran has proposed a four-point plan for yemen but saudis have ignored it . vatanka : saudis have tried to muster a ground invasion coalition but have failed . '

the tagged data provided in this repo is:
'between iran and saudi arabia have but has as it is . and point to for in yemen a saudi arab saudi arabia are no an saudis on iran we'

But you can see the word 'we' is not in the summary, then why it is tagged 1. Anyway, could you please provide the code which tags the training data? That would really help me a lot. Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.