Giter VIP home page Giter VIP logo

transformersum's Introduction

👋   Hi! My name is Andrew.

Here's are some facts about me:

  • 💡   I'm a software engineer working in the San Francisco Bay Area.
  • 🎓   I graduated from Georgia Tech specializing in machine learning. Happy to connect with aspiring/current Yellow Jackets interested in the tech industry!
  • 🌱   I'm all up for building robust and cost-efficient ML models and delivering trustworthy AI solutions to production.
  • ⛹️‍♂️   In my free time, I enjoy working out, playing basketball and learning new things.
  • ✉️   Feel free to reach out to me on LinkedIn.

🛠   Tech Stack

Python  SQL  Java  C  JavaScript  HTML  CSS  Git  GitHub  PyCharm 

NumPy  Scipy  Pandas  Scikit-learn  OpenCv  NLTK  Spacy  Transformers  PyTorch  Tensorflow  Keras 

Django  MongoDB  FastAPI  Docker  Kubernetes  Google Cloud Platform  Tableau 

transformersum's People

Contributors

achen353 avatar deepsource-autofix[bot] avatar hhousen avatar imgbotapp avatar stephanieeechang avatar

Stargazers

 avatar

Watchers

 avatar

transformersum's Issues

DANCER (Part 1)

DANCER (PART 1): Implement the ROUGE score calculation between "each sentence in the ground truth abstractive summary" and "each section". Map each sentence in the former to a section.

DUE: 11/20 Saturday 11:59pm

Modify convert_to_extractive.py

Convert the billsum HuggingFace dataset into an abstractive dataset

TODO

  • Fix any bug encountered
  • Add code to split train split into train and validation (since billsum on Huggingface only has train, test, ca_test splits)
  • Check if we need to remove extra white spaces and escaped characters like \n and \t

Testing on Abstractive Summarizer

TODO

  • Understand how it works: how are data transformed throughout the process, difference against Extractive Summarizer (ES)
  • Understand the architecture of the base models (LongformerEncoderDecoder)

Replace the main README.md

TODO

  • Change the original README.md to another file name (e.g. README.orig.md)
  • Make our own README.md that includes a brief overview of the project (similar to the structure of the final report)

Note: How to Use this Repo

  1. NEVER commit directly to master
  2. Open an issue for any tasks you're working on or problem you're trying to solve
  3. Make PRs for any code change for any corresponding issues you are assigned to
  4. No rules on Issue/PR naming, but, preferably for your branches, name it like add-<issue #>-<short description> (e.g. add-1-abstractive-test-results. Other starting phrases are fine too (e.g. add, fix, refactor)
  5. COMMENT your code, especially when there's a complicated logic to follow
  6. For code, follow general Python convention and, when possible, use formatters like black and isort.

Calculate statistics of BillSum

  1. Calculate some statistics of Billsum data:
    • Avg. length of document
    • Avg. length of ground truth document summary
    • Avg. length of predicted document summaruy
  2. Give sample predicted summary

DANCER (Part 2)

DANCER (PART 2): Modify the code so "given a document, the model would automatically break the document into parts based on and other similar tokens."

DUE: 11/20 Saturday 11:59pm

Setup GCP Compute Engine

Some of the language models are going to be humongous. Need to set up a Compute Engine instance on GCP (probably multi-GPU)

Useful Reference/Tutuorial

Hopefully, the tutorials in these links are still valid (used them more than a year ago). If anything does not apply or there exists better tutorials, please let the others know.

TODO

  • Request GPU quotas
  • Set up Compute Engine VM instance

Preprocess BillSum Dataset

TODO

  • Understand and compare how the CNN, Arxiv/PubMed datasets are preprocessed for TransformerSum
  • Devise a concatenation scheme for the BillSum Dataset

Fix Combination and Greedy strategies

Fix Combination and Greedy strategies so that each section at least has one sentence to be assigned as 1 (meaning it's classified as being part of an extractive summary)

DANCER (Part 3)

DANCER (PART 3): Create functions that would calculate the final ROUGE score of a document by aggregating (1) the predicted section summary and (2) the ground truth summary, respectively, and compare.

DUE: 11/20 Saturday 11:59pm

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.