Giter VIP home page Giter VIP logo

newsjam's Introduction

Newsjam

This repository contains the code, data and results for the paper Newsjam: A Multilingual Summarization Tool for News Articles by Joseph Keenan, Shane Kaszefski-Yaschuk, Adelia Khasanova, and Maxime Méloux presented at eKNOW 2022 and published in the ThinkMind Digital Library.

Repository structure:

  • Annotation_Stats.ipynb contains the IAA computation module.
  • Pipeline.ipynb contains the full bot pipeline implementation, from scraping to posting. It requires a Twitter API key to run, which is not included in this repository for security reasons.
  • main.ipynb contains the main summarization module. It can instantiate specific summarization and evaluation submodules, as well as save generated summaries to an output file.
  • classif\ contains data, scripts and notebooks related to the classification subtask:
    • \Annotation Guidelines.docx contains the annotation guidelines
    • \csv_lest_republicain_summ.csv contains the L'Est Républicain corpus and its manual tags
    • \log_reg_classifier.ipynb, \doc_classification_logistic.ipynb and \naive_bayes.ipynb implement various classification methods
    • \log_reg_classifier.py is the final classifier used in the pipeline
  • data\ contains data, scripts and notebooks related to scraping. In particular:
    • \est_republicain.ipynb contains the scraping functions for L'Est Républicain
    • \est_republicain.json contains the JSON-formatted list of articles extracted from L'Est Républicain
    • \scraper_functions.ipynb contains the scraping functions for Actu
    • \scraper_functions.py contains the final scraper for Actu that is used in the pipeline
    • \actu_articles.json contains the JSON-formatted list of articles extracted from Actu
  • deliver\ contains all reports, slides and posters that were delivered during the project's lifetime
  • eval\ contains all the modules implementing evaluation metrics:
    • \bert_eval.py contains the implementation of BERTScore
    • \eval.py is a helper file containing generic evaluation functions
    • \rouge_l.py contains the implementation of the ROUGE-L score
    • \time.py contains the implementation of the running time measurement
  • gen\ contains the generated summaries by all three summarizers when ran on our own corpus, in full and keyword-only (kw) versions
  • summ\ contains all the modules implementing summarization methods:
    • \bert_embed.py contains the implementation of summarization using BERT-like models and K-means clustering
    • \lsa.py contains the implementation of summarization using Latent Semantic Analysis
    • \lsa.ipynb contains a notebook version of the previous implementation, made to be more readable and interactive, so that everyone can run the programs step by step and see the output of each section.
    • text_processing.py contains the text pre and post processing functions.
    • \sum_transformers.ipynb contains an alternate implementation of summarization using BERT-like models
    • \utils.py contains various utility functions for summarization

Reference

@inproceedings{keenan-et-al-2022-newsjam, 
title={Newsjam: A Multilingual Summarization Tool for News Articles},
author={Keenan, Joseph and Kaszefski-Yaschuk, Shane and Khasanova, Adelia and Méloux, Maxime},
ISBN={9781612089867}, 
url={https://www.thinkmind.org/index.php?view=article&articleid=eknow_2022_3_10_60008}, 
year={2022}, 
month={06}, 
pages={55–61}}

newsjam's People

Contributors

keenjo avatar pie3636 avatar kaszefski avatar adele-kha avatar mnikiema avatar

Stargazers

 avatar

Watchers

James Cloos avatar  avatar

Forkers

mnikiema

newsjam's Issues

Bypassing twitter character limit

Problem: some good summaries are longer than 280 characters. The pipeline then selects one that fits a tweet even though it may be of lower quality.

Idea: generate an image containing the summary, and then post that on twitter.

This bypasses the character limit in twitter.
The actual text of the tweet could be hashtags of keywords, or links.

Example:

image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.