Light

achen353 / transformersum Goto Github PK

BERT-based extractive summarizer for long legal document using a divide-and-conquer approach

License: GNU General Public License v3.0

Python 100.00%

legal-documents extractive-text-summarization extractive-summarization bert-model divide-and-conquer-approach divide-and-conquer

transformersum's Introduction

👋 Hi! My name is Andrew.

Here's are some facts about me:

💡 I'm a software engineer working in the San Francisco Bay Area.
🎓 I graduated from Georgia Tech specializing in machine learning. Happy to connect with aspiring/current Yellow Jackets interested in the tech industry!
🌱 I'm all up for building robust and cost-efficient ML models and delivering trustworthy AI solutions to production.
⛹️‍♂️ In my free time, I enjoy working out, playing basketball and learning new things.
✉️ Feel free to reach out to me on LinkedIn.

🛠 Tech Stack

transformersum's People

Contributors

Stargazers

Watchers

transformersum's Issues

DANCER (Part 1)

DANCER (PART 1): Implement the ROUGE score calculation between "each sentence in the ground truth abstractive summary" and "each section". Map each sentence in the former to a section.

DUE: 11/20 Saturday 11:59pm

Use the built-in data prep functions in the original BillSum repo

Use the built-in data prep functions in the original BillSum repo.

https://github.com/FiscalNote/BillSum/tree/master/billsum/data_prep

If possible, continue using convert_to_extractive.py but data prep functions to it and use these functions when the specified dataset is billsum. Other datasets would still be applied the same procedures as designed in TransformerSum.

Modify convert_to_extractive.py

Convert the billsum HuggingFace dataset into an abstractive dataset

TODO

Fix any bug encountered
Add code to split train split into train and validation (since billsum on Huggingface only has train, test, ca_test splits)
Check if we need to remove extra white spaces and escaped characters like \n and \t

Testing on Abstractive Summarizer

TODO

Understand how it works: how are data transformed throughout the process, difference against Extractive Summarizer (ES)
Understand the architecture of the base models (LongformerEncoderDecoder)

Replace the main README.md

TODO

Change the original README.md to another file name (e.g. README.orig.md)
Make our own README.md that includes a brief overview of the project (similar to the structure of the final report)

Note: How to Use this Repo

NEVER commit directly to master
Open an issue for any tasks you're working on or problem you're trying to solve
Make PRs for any code change for any corresponding issues you are assigned to
No rules on Issue/PR naming, but, preferably for your branches, name it like add-<issue #>-<short description> (e.g. add-1-abstractive-test-results. Other starting phrases are fine too (e.g. add, fix, refactor)
COMMENT your code, especially when there's a complicated logic to follow
For code, follow general Python convention and, when possible, use formatters like black and isort.

Calculate statistics of BillSum

Calculate some statistics of Billsum data:
- Avg. length of document
- Avg. length of ground truth document summary
- Avg. length of predicted document summaruy
Give sample predicted summary

Labeling Function

LABELING FUNCTION: Verify the Labeling Function difference between TransformerSum (https://github.com/achen353/TransformerSum/blob/master/src/convert_to_extractive.py) and BillSum (https://github.com/FiscalNote/BillSum/blob/master/billsum/data_prep/label_sentences.py), adopt BillSum and make it an option in convert_to_extractive.py. Prepare the data with the labeling function

DUE: 11/20 Saturday 11:59pm

DANCER (Part 2)

DANCER (PART 2): Modify the code so "given a document, the model would automatically break the document into parts based on and other similar tokens."

DUE: 11/20 Saturday 11:59pm

Setup GCP on personal accounts and request GPUs

Related:
#8

In addition to the GCP account for this group project, request GPUs on your personal account to increase the chance of getting GPU in time.

Setup GCP Compute Engine

Some of the language models are going to be humongous. Need to set up a Compute Engine instance on GCP (probably multi-GPU)

Useful Reference/Tutuorial

Hopefully, the tutorials in these links are still valid (used them more than a year ago). If anything does not apply or there exists better tutorials, please let the others know.

TODO

Request GPU quotas
Set up Compute Engine VM instance

Test instructions on README.md

Test out all the instructions on README.md before our GPU requests are approved by GCP

Test Training with Extractive BillSum (TransformerSum style)

Try out using the dataset generated by convert_to_extractive.py to train the model

Preprocess BillSum Dataset

TODO

Understand and compare how the CNN, Arxiv/PubMed datasets are preprocessed for TransformerSum
Devise a concatenation scheme for the BillSum Dataset

Integrate DANCER-summ

Overview

Given the divide and conquer approach of DANCER-summ, we will handle the summarization of a single document by section
Paper: https://arxiv.org/pdf/2004.06190.pdf
GitHub: https://github.com/AlexGidiotis/DANCER-summ

TODO

NOTE: Labeling Function is defined as the function that helps provide binary ground truth for all the sentences in the text for extractive summarization.

Paper: https://arxiv.org/pdf/1912.08777.pdf
GitHub: https://github.com/google-research/pegasus
Huggingface API: https://huggingface.co/transformers/model_doc/pegasus.html

DANCER-summ (D&C approach on long text summarization):

Paper: https://arxiv.org/pdf/2004.06190.pdf
GitHub: https://github.com/AlexGidiotis/DANCER-summ

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.