Giter VIP home page Giter VIP logo

newsroom's Introduction

Installation Instructions

Newsroom requires Python 3 and can be installed using pip:

pip install -e git+git://github.com/clic-lab/newsroom.git#egg=newsroom

Data Processing Tools

Newsroom contains two scripts for downloading and processing data downloaded from Archive.org. First, download the "thin" data from summari.es:

wget https://summari.es/files/thin.tar
tar xvf thin.tar

Both the newsroom-scrape and newsroom-extract tools described below have argument help pages accessed with the --help command line option.

Data Scraping

The thin directory will contain three files, train.jsonl.gz, dev.jsonl.gz and test.jsonl.gz. To begin downloading the development set from Archive.org, run the following:

newsroom-scrape --thin thin/dev.jsonl.gz --archive dev.archive

Estimated download time is indicated with a progress bar. If errors occur during downloading, you may need to re-run the script later to capture the missing articles. This process is network bound and depends mostly on Archive.org, save your CPU cycles for the extraction stage!

The downloading process can be stopped at any time with Control-C and resumed later. It is also possible to perform extraction of a partially downloaded dataset with newsroom-extract before continuing to download the full version.

Data Extraction

The newsroom-extract tool extracts summaries and article text from the data downloaded by newsroom-scrape. This tool produces a new file that does not modify the original output file of newsroom-scrape, and can be run with:

newsroom-extract --archive dev.archive --dataset dev.data

The script automatically parallelizes extraction across your CPU cores. To disable this or reduce the number of cores used, use the --workers option. Like scraping, the extraction process can be stopped at any point with Control-C and resumed later.

Reading and Analyzing the Data

All data are represented using gzip-compressed JSON lines. The Newsroom package provides an easy tool to read an write these files โ€” and do so up to 20x faster than the standard Python gz and json packages!

from newsroom import jsonl

# Read entire file:

with jsonl.open("train.data", gzip = True) as train_file:
    train = train_file.read()

# Read file entry by entry:

with jsonl.open("train.data", gzip = True) as train_file:
    for entry in train_file:
        print(entry["summary"], entry["text"])

Extraction Analysis

The Newsroom package also contains scripts for identifying extractive fragments and computing metrics described in the paper: coverage, density, and compression.

import random

from newsroom import jsonl
from newsroom.analyze import Fragments

with jsonl.open("train.data", gzip = True) as train_file:
    train = train_file.read()

# Compute stats on random training example:

entry = random.choice(train)
summary, text = train[0]["summary"], train[0]["text"]
fragments = Fragments(summary, text)

# Print paper metrics:

print("Coverage:",    fragments.coverage())
print("Density:",     fragments.density())
print("Compression:", fragments.compression())

# Extractive fragments oracle:

print("List of extractive fragments:")
print(fragments.strings())

Evaluation Tools

Available soon!

newsroom's People

Contributors

grusky avatar

Watchers

James Cloos avatar Shashank Gupta avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.