wikiasp's Introduction

WikiAsp: A Dataset for Multi-domain Aspect-based Summarization

This repository contains the dataset from the paper "WikiAsp: A Dataset for Multi-domain Aspect-based Summarization".

WikiAsp is a multi-domain, aspect-based summarization dataset in the encyclopedic domain. In this task, models are asked to summarize cited reference documents of a Wikipedia article into aspect-based summaries. Each of the 20 domains include 10 domain-specific pre-defined aspects.

Dataset

Download

WikiAsp is a available via 20 zipped archives, each of which corresponds to a domain. More than 28GB of storage space is necessary to download and store all the domains (unzipped). The following command will download all of them and extract archives:

./scripts/download_and_extract_all.sh /path/to/save_directory

Alternatively, one can individually download an archive for each domain from the table below. (Note: left-clicking will not prompt downloading dialogue. Open the link in a new tab, or save from the context menu on your OS, or use wget.)

Domain	Link	Size (unzipped)
Album	Download	2.3GB
Animal	Download	589MB
Artist	Download	2.2GB
Building	Download	1.3GB
Company	Download	1.9GB
EducationalInstitution	Download	1.9GB
Event	Download	900MB
Film	Download	2.8GB
Group	Download	1.2GB
HistoricPlace	Download	303MB
Infrastructure	Download	1.3GB
MeanOfTransportation	Download	792MB
OfficeHolder	Download	2.0GB
Plant	Download	286MB
Single	Download	1.5GB
SoccerPlayer	Download	721MB
Software	Download	1.3GB
TelevisionShow	Download	1.1GB
Town	Download	932MB
WrittenWork	Download	1.8GB

Format

Each domain includes three files {train,valid,test}.jsonl, and each line represents one instance in JSON format. Each instance forms the following structure:

{
    "exid": "train-1-1",
    "input": [  
        "tokenized and uncased sentence_1 from document_1",
        "tokenized and uncased sentence_2 from document_1",
        "...",
        "tokenized and uncased sentence_i from document_j",
        "..."
    ],
    "targets": [ 
        ["a_1", "tokenized and uncased aspect-based summary for a_1"],
        ["a_2", "tokenized and uncased aspect-based summary for a_2"],
        "..."
    ]
}

where,

exid: str
input: List[str]
targets: List[Tuple[str,str]]

Here, input is the cited references and consists of tokenized sentences (with NLTK). The targets key points to a list of aspect-based summaries, where each element is a pair of a) the target aspect and b) the aspect-based summary.

Inheriting from the base corpus, this dataset exhibits the following characteristics:

Cited references are composed of multiple documents, but the document boundaries are lost, thus expressed simply in terms of list of sentences.
Sentences in the cited references (input) are tokenized using NLTK.
The number of target summaries for each instance varies.

Citation

If you use the dataset, please consider citing with

@article{hayashi20tacl,
    title = {WikiAsp: A Dataset for Multi-domain Aspect-based Summarization},
    author = {Hiroaki Hayashi and Prashant Budania and Peng Wang and Chris Ackerson and Raj Neervannan and Graham Neubig},
    journal = {Transactions of the Association for Computational Linguistics (TACL)},
    month = {},
    url = {https://arxiv.org/abs/2011.07832},
    year = {2020}
}

LICENSE

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

wikiasp's People

Contributors

Stargazers

Watchers

wikiasp's Issues

Mapping to Wikipedia article title

Hello,
Thank you for open sourcing this dataset.

I wonder if there is any meta data to map the data instances to Wiki title?
Also, when following the process of WikiSum did you only use the Commoncrawl part of the data and not the web?

Thanks,

js code in input & empty outputs

I've went through some of the data in different domains. There are huge portion of inputs contain the javascripts code and urls crwaled from the website, and a large porition of the outpus are just empty (target summary). Could you please provide any cleaned version of the data or filtered version of data?

Request: add dataset as Git LFS

Hello, I am trying to add your dataset into hugginface datasets library (https://github.com/huggingface/datasets).
Currently, we are unable to download to files directly from releases because of the way GitHub manages (i.e stores) the assets.

Would it be possible for you to instead add the files as LFS (https://git-lfs.github.com/)?

git lfs track "*.bz2"
mkdir data
# copy bz2 files in the data directory
git add data
git commit

I would have submitted the pull request but GitHub does not allow adding LFS to the forks 😯

request for metadata

Dear Author,

Thank you for the dataset!

I want to know if list of aspects in the "targets" are according to the order that they appear in the Wikipedia article?
If not, is there some form of meta-data to indicate the corresponding Wikipedia article link for the "targets"?
Do you also have the URLs used for the "inputs" for each data point?

Thank you!

Recommend Projects

neulab / wikiasp Goto Github PK

wikiasp's Introduction

WikiAsp: A Dataset for Multi-domain Aspect-based Summarization

Dataset

Download

Format

Citation

LICENSE

wikiasp's People

Contributors

Stargazers

Watchers

Forkers

wikiasp's Issues

Mapping to Wikipedia article title

js code in input & empty outputs

Request: add dataset as Git LFS

request for metadata

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent