transformersum's Issues

Use the built-in data prep functions in the original BillSum repo

Use the built-in data prep functions in the original BillSum repo.

https://github.com/FiscalNote/BillSum/tree/master/billsum/data_prep

If possible, continue using convert_to_extractive.py but data prep functions to it and use these functions when the specified dataset is billsum. Other datasets would still be applied the same procedures as designed in TransformerSum.

Labeling Function

LABELING FUNCTION: Verify the Labeling Function difference between TransformerSum (https://github.com/achen353/TransformerSum/blob/master/src/convert_to_extractive.py) and BillSum (https://github.com/FiscalNote/BillSum/blob/master/billsum/data_prep/label_sentences.py), adopt BillSum and make it an option in convert_to_extractive.py. Prepare the data with the labeling function

DUE: 11/20 Saturday 11:59pm

Fix Combination and Greedy strategies

Fix Combination and Greedy strategies so that each section at least has one sentence to be assigned as 1 (meaning it's classified as being part of an extractive summary)

Reference (constantly updated)

PEGASUS (Google's abstractive-extractive mix summarizer, achieving many SOTA performances):

Paper: https://arxiv.org/pdf/1912.08777.pdf
GitHub: https://github.com/google-research/pegasus
Huggingface API: https://huggingface.co/transformers/model_doc/pegasus.html

DANCER-summ (D&C approach on long text summarization):

Paper: https://arxiv.org/pdf/2004.06190.pdf
GitHub: https://github.com/AlexGidiotis/DANCER-summ

Setup GCP Compute Engine

Some of the language models are going to be humongous. Need to set up a Compute Engine instance on GCP (probably multi-GPU)

Useful Reference/Tutuorial

Hopefully, the tutorials in these links are still valid (used them more than a year ago). If anything does not apply or there exists better tutorials, please let the others know.

TODO

Request GPU quotas
Set up Compute Engine VM instance

Testing on Abstractive Summarizer

TODO

Understand how it works: how are data transformed throughout the process, difference against Extractive Summarizer (ES)
Understand the architecture of the base models (LongformerEncoderDecoder)

Test instructions on README.md

Test out all the instructions on README.md before our GPU requests are approved by GCP

DANCER (Part 2)

DANCER (PART 2): Modify the code so "given a document, the model would automatically break the document into parts based on and other similar tokens."

DUE: 11/20 Saturday 11:59pm

Calculate statistics of BillSum

Calculate some statistics of Billsum data:
- Avg. length of document
- Avg. length of ground truth document summary
- Avg. length of predicted document summaruy
Give sample predicted summary

Replace the main README.md

TODO

Change the original README.md to another file name (e.g. README.orig.md)
Make our own README.md that includes a brief overview of the project (similar to the structure of the final report)

Test Training with Extractive BillSum (TransformerSum style)

Try out using the dataset generated by convert_to_extractive.py to train the model

DANCER (Part 1)

DANCER (PART 1): Implement the ROUGE score calculation between "each sentence in the ground truth abstractive summary" and "each section". Map each sentence in the former to a section.

DUE: 11/20 Saturday 11:59pm

Modify convert_to_extractive.py

Convert the billsum HuggingFace dataset into an abstractive dataset

TODO

Fix any bug encountered
Add code to split train split into train and validation (since billsum on Huggingface only has train, test, ca_test splits)
Check if we need to remove extra white spaces and escaped characters like \n and \t

DANCER (Part 3)

DANCER (PART 3): Create functions that would calculate the final ROUGE score of a document by aggregating (1) the predicted section summary and (2) the ground truth summary, respectively, and compare.

DUE: 11/20 Saturday 11:59pm

Integrate DANCER-summ

Overview

Given the divide and conquer approach of DANCER-summ, we will handle the summarization of a single document by section
Paper: https://arxiv.org/pdf/2004.06190.pdf
GitHub: https://github.com/AlexGidiotis/DANCER-summ

TODO

NOTE: Labeling Function is defined as the function that helps provide binary ground truth for all the sentences in the text for extractive summarization.

Setup GCP on personal accounts and request GPUs

Related:
#8

In addition to the GCP account for this group project, request GPUs on your personal account to increase the chance of getting GPU in time.

Preprocess BillSum Dataset

TODO

Understand and compare how the CNN, Arxiv/PubMed datasets are preprocessed for TransformerSum
Devise a concatenation scheme for the BillSum Dataset

Note: How to Use this Repo

NEVER commit directly to master
Open an issue for any tasks you're working on or problem you're trying to solve
Make PRs for any code change for any corresponding issues you are assigned to
No rules on Issue/PR naming, but, preferably for your branches, name it like add-<issue #>-<short description> (e.g. add-1-abstractive-test-results. Other starting phrases are fine too (e.g. add, fix, refactor)
COMMENT your code, especially when there's a complicated logic to follow
For code, follow general Python convention and, when possible, use formatters like black and isort.

achen353 / transformersum Goto Github PK

transformersum's Issues

Overview

TODO

Recommend Projects

Recommend Topics

Recommend Org