Giter VIP home page Giter VIP logo

trade-off-kvd's Introduction

Trade-off between Informativeness, Redundancy, and Local Coherence in Extractive Summarization

We introduced two extractive summarization systems, TreeKvD and GraphKvD, based on the Micro-Macro Structure Theory of human reading comprehension (Kintsch & van Dijk, 1978; aka 'KvD'), equipped with control mechanisms to balance properties in the final summary such as informativeness, redundancy, and cohesion. This repository contains code to replicate the results presented in our paper

@misc{https://doi.org/10.48550/arxiv.2205.10192,
  doi = {10.48550/ARXIV.2205.10192},
  url = {https://arxiv.org/abs/2205.10192},
  author = {Cardenas, Ronald and Galle, Matthias and Cohen, Shay B.},
  title = {On the Trade-off between Redundancy and Local Coherence in Summarization},
  publisher = {arXiv},
  year = {2022}
}

Installation

Create a python environment with Python 3.7 and the libraries in requirements.txt. Additionally, you will need Spacy EN model for evaluation

python -m spacy download en_core_web_sm

Dataset files

We provide the preprocessed files for the arXiv and PubMed datasets here.
Each dataset split is a line-wise pickle file, each sample with the format

{
	"id": Str, # original ID in Cohan dataset
	"section_names": List[Str], # List of section names
	"section_sents": List[List[Str]] # List of sentences, grouped by section
	"section_ptrees": List[List[(int,Dict)]] # List of tuples (tree root, proposition tree) corresponding to each source sentence, grouped by section
	"doc_props": Dict{int:Proposition} # Dictionary of propositions in source document, mapping their id to their Proposition object.
	"abs_sents": List[str] # List of sentences in the abstract
	"abs_ptrees": List[(int,Dict)] # # List of tuples (tree root, proposition tree) corresponding to each target (abstract) sentence
	"abs_props": Dict{int:Proposition} # Dictionary of propositions in the abstract
}

Preprocessing ArXiv and PubMed from scratch

Download Cohan dataset from here and unzip them.
The preprocessing scripts assume the following directory structure

datasets/
	arxiv/
		train.jsonl
		valid.jsonl
		test.jsonl
	pubmed/
		train.jsonl
		valid.jsonl
		test.jsonl
redundancy-kvd/ (this repository)

You will need UDPipe 1.2.0 dependency parser, here, and its pretrained model for English, here.

Then, refer to data_preprocessing/preprocessing_steps.sh.

Running the KvD Summarizers

The systems load their hyperparameter configuration from a JSON or CSV file in the config_files subfolder.
For instance, to run TreeKvD over the test set of arXiv with the hyper-parameters reported in the paper, use

cd treekvd/
python run.py -d arxiv -s test -nj <num-cpus> --exp_id <experiment-name> --conf conf_files/recommended.json

The predictions will be saved in folder treekvd/exps/<experiment-name>/arxiv-test/.
Use the same command from the graphkvd subfolder to run GraphKvD instead.
If you wish to run more configurations at a time, you can add more rows to the CSV file or more elements to the corresponding list in the JSON file.

Evaluation

Move into evaluation/ folder and run the following scripts according to the desired metric.

  • ROUGE scores
python run_srouge.py -d arxiv -s test -nj <num-cpus> --pred <prediction JSON file>
  • Redundancy metrics and candidate summary statistics such as summary length, coverage, and density.
python run_summeval.py -d arxiv -s test -nj <num-cpus> --pred <prediction JSON file>
  • Bert-Score with SciBert as core
python run_scibert_score.py -d arxiv -s test -nj <num-cpus> --pred <prediction JSON file>
  • Local coherence as language model perplexity.
python run_ppl_lcoh.py -d arxiv -s test --pred <prediction JSON file>
  • Aggregate all results into a CSV table
python report_evaluation.py -d arxiv -s test -nj <num-cpus> --pred_file <prediction JSON file> --output <CSV file name>

In all cases, argument --pred can also be a folder name, in which case the evaluation runs for each .json file found inside.

trade-off-kvd's People

Contributors

ronaldahmed avatar

Stargazers

Marcio Fonseca avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.