Giter VIP home page Giter VIP logo

transformersum's Issues

Fix Combination and Greedy strategies

Fix Combination and Greedy strategies so that each section at least has one sentence to be assigned as 1 (meaning it's classified as being part of an extractive summary)

Setup GCP Compute Engine

Some of the language models are going to be humongous. Need to set up a Compute Engine instance on GCP (probably multi-GPU)

Useful Reference/Tutuorial

Hopefully, the tutorials in these links are still valid (used them more than a year ago). If anything does not apply or there exists better tutorials, please let the others know.

TODO

  • Request GPU quotas
  • Set up Compute Engine VM instance

Testing on Abstractive Summarizer

TODO

  • Understand how it works: how are data transformed throughout the process, difference against Extractive Summarizer (ES)
  • Understand the architecture of the base models (LongformerEncoderDecoder)

DANCER (Part 2)

DANCER (PART 2): Modify the code so "given a document, the model would automatically break the document into parts based on and other similar tokens."

DUE: 11/20 Saturday 11:59pm

Calculate statistics of BillSum

  1. Calculate some statistics of Billsum data:
    • Avg. length of document
    • Avg. length of ground truth document summary
    • Avg. length of predicted document summaruy
  2. Give sample predicted summary

Replace the main README.md

TODO

  • Change the original README.md to another file name (e.g. README.orig.md)
  • Make our own README.md that includes a brief overview of the project (similar to the structure of the final report)

DANCER (Part 1)

DANCER (PART 1): Implement the ROUGE score calculation between "each sentence in the ground truth abstractive summary" and "each section". Map each sentence in the former to a section.

DUE: 11/20 Saturday 11:59pm

Modify convert_to_extractive.py

Convert the billsum HuggingFace dataset into an abstractive dataset

TODO

  • Fix any bug encountered
  • Add code to split train split into train and validation (since billsum on Huggingface only has train, test, ca_test splits)
  • Check if we need to remove extra white spaces and escaped characters like \n and \t

DANCER (Part 3)

DANCER (PART 3): Create functions that would calculate the final ROUGE score of a document by aggregating (1) the predicted section summary and (2) the ground truth summary, respectively, and compare.

DUE: 11/20 Saturday 11:59pm

Preprocess BillSum Dataset

TODO

  • Understand and compare how the CNN, Arxiv/PubMed datasets are preprocessed for TransformerSum
  • Devise a concatenation scheme for the BillSum Dataset

Note: How to Use this Repo

  1. NEVER commit directly to master
  2. Open an issue for any tasks you're working on or problem you're trying to solve
  3. Make PRs for any code change for any corresponding issues you are assigned to
  4. No rules on Issue/PR naming, but, preferably for your branches, name it like add-<issue #>-<short description> (e.g. add-1-abstractive-test-results. Other starting phrases are fine too (e.g. add, fix, refactor)
  5. COMMENT your code, especially when there's a complicated logic to follow
  6. For code, follow general Python convention and, when possible, use formatters like black and isort.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.