achen353 / transformersum Goto Github PK
View Code? Open in Web Editor NEWBERT-based extractive summarizer for long legal document using a divide-and-conquer approach
License: GNU General Public License v3.0
BERT-based extractive summarizer for long legal document using a divide-and-conquer approach
License: GNU General Public License v3.0
Use the built-in data prep functions in the original BillSum repo.
https://github.com/FiscalNote/BillSum/tree/master/billsum/data_prep
If possible, continue using convert_to_extractive.py
but data prep functions to it and use these functions when the specified dataset is billsum
. Other datasets would still be applied the same procedures as designed in TransformerSum.
LABELING FUNCTION: Verify the Labeling Function difference between TransformerSum (https://github.com/achen353/TransformerSum/blob/master/src/convert_to_extractive.py) and BillSum (https://github.com/FiscalNote/BillSum/blob/master/billsum/data_prep/label_sentences.py), adopt BillSum and make it an option in convert_to_extractive.py. Prepare the data with the labeling function
DUE: 11/20 Saturday 11:59pm
Fix Combination and Greedy strategies so that each section at least has one sentence to be assigned as 1
(meaning it's classified as being part of an extractive summary)
PEGASUS (Google's abstractive-extractive mix summarizer, achieving many SOTA performances):
DANCER-summ (D&C approach on long text summarization):
Some of the language models are going to be humongous. Need to set up a Compute Engine instance on GCP (probably multi-GPU)
Useful Reference/Tutuorial
Hopefully, the tutorials in these links are still valid (used them more than a year ago). If anything does not apply or there exists better tutorials, please let the others know.
TODO
TODO
Test out all the instructions on README.md before our GPU requests are approved by GCP
DANCER (PART 2): Modify the code so "given a document, the model would automatically break the document into parts based on and other similar tokens."
DUE: 11/20 Saturday 11:59pm
TODO
Try out using the dataset generated by convert_to_extractive.py
to train the model
DANCER (PART 1): Implement the ROUGE score calculation between "each sentence in the ground truth abstractive summary" and "each section". Map each sentence in the former to a section.
DUE: 11/20 Saturday 11:59pm
Convert the billsum HuggingFace dataset into an abstractive dataset
TODO
train
split into train
and validation
(since billsum on Huggingface only has train
, test
, ca_test
splits)\n
and \t
DANCER (PART 3): Create functions that would calculate the final ROUGE score of a document by aggregating (1) the predicted section summary and (2) the ground truth summary, respectively, and compare.
DUE: 11/20 Saturday 11:59pm
Given the divide and conquer approach of DANCER-summ, we will handle the summarization of a single document by section
Paper: https://arxiv.org/pdf/2004.06190.pdf
GitHub: https://github.com/AlexGidiotis/DANCER-summ
NOTE: Labeling Function is defined as the function that helps provide binary ground truth for all the sentences in the text for extractive summarization.
Related:
#8
In addition to the GCP account for this group project, request GPUs on your personal account to increase the chance of getting GPU in time.
TODO
master
add-<issue #>-<short description>
(e.g. add-1-abstractive-test-results
. Other starting phrases are fine too (e.g. add
, fix
, refactor
)A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.