Curation Corpus for Abstractive Text Summarisation

The Curation Corpus is a collection of 40,000 professionally-written summaries of news articles, with links to the articles themselves. This repository provides a scraper to access them. If you're interested in commercial use or access to the wider catalogue of Curation data, including a larger set of over 150,000 professionally-written abstracts and a scalable, on-demand content abstraction API (driven by humans or AI), please get in touch. For our thoughts on how we hope this release will help the NLP community, see our post introducing the dataset.

	Documents	License	Avg. summary length (words)	Avg. document length (words)	Avg. summary length (sentences)	Avg. document length (sentences)	Type
CNN	90,266	N/A	45.7	760.5	3.59	34	Implied by "summary" box
DailyMail	196,961	N/A	54.7	653.3	3.86	29.3	Implied by bullets below headline
NYT	110,540	Non-commerical	45.5	800	2.44	35.6	Abstractive summary
Xsum	276,711	N/A	23.3	431	1	19.7	Single sentence answering "what is this article about?"
Curation Base	40,000	CC-BY	82.6	527.9	4.9	27.4	Professionally written and edited standalone summary intended to be understood by itself
Curation Large	134,849*	Commercial	81.3	521	4.9	27	Professionally written and edited standalone summary intended to be understood by itself

Instructions

Clone this repository (or just copy the code from scraper.py)

Download the urls, headlines, and summaries from here

Run web_scraper.py. Give as command line arguments the path to the csv file without article text, the path to a new csv file which will have article text, and a batch size to determine how many urls it will scrape at a time. Larger batch sizes will make it run faster but it may drop more articles due to timeouts. I recommend ~50 on a 2015 Macbook Pro.

git clone https://github.com/CurationCorp/curation-corpus.git
cd curation-corpus
wget https://curation-datasets.s3-eu-west-1.amazonaws.com/curation-corpus-base.csv
python web_scraper.py curation-corpus-base.csv curation-corpus-base-with-articles.csv 50

Some urls will return messy results due to content changing over time, paywalls, etc. We've tried to remove the worst offenders from this release. There is probably still scope though for improving the scraper though.

Tutorials

We are still learning about this field ourselves and will share our tutorials in the examples folder. If you use our dataset in your own research, write a tutorial, or have anything you would like to share, let us know and we will link to it from here!

About Curation

Curation is a SaaS business combining machine learning & human intelligence enabling executives to effortlessly follow emerging risks, themes and client activity with a particular focus on ESG-related issues. We enable businesses to act faster, delivering significant time and cost savings.

Citation

@misc{curationcorpusbase:2020,
  title={Curation Corpus Base},
  author={Curation Corporation},
  year={2020}
}

License

Please remember to attribute any derivative works in accordance with the terms of the CC-BY license.

This work is licensed under a Creative Commons Attribution 4.0 International License.

ohmeow / curation-corpus Goto Github PK

curation-corpus's Introduction

Curation Corpus for Abstractive Text Summarisation

Instructions

Tutorials

About Curation

Citation

License

curation-corpus's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent