Giter VIP home page Giter VIP logo

wikiasp's Introduction

WikiAsp: A Dataset for Multi-domain Aspect-based Summarization

This repository contains the dataset from the paper "WikiAsp: A Dataset for Multi-domain Aspect-based Summarization".

WikiAsp is a multi-domain, aspect-based summarization dataset in the encyclopedic domain. In this task, models are asked to summarize cited reference documents of a Wikipedia article into aspect-based summaries. Each of the 20 domains include 10 domain-specific pre-defined aspects.

wikiasp

Dataset

Download

WikiAsp is a available via 20 zipped archives, each of which corresponds to a domain. More than 28GB of storage space is necessary to download and store all the domains (unzipped). The following command will download all of them and extract archives:

./scripts/download_and_extract_all.sh /path/to/save_directory

Alternatively, one can individually download an archive for each domain from the table below. (Note: left-clicking will not prompt downloading dialogue. Open the link in a new tab, or save from the context menu on your OS, or use wget.)

Domain Link Size (unzipped)
Album Download 2.3GB
Animal Download 589MB
Artist Download 2.2GB
Building Download 1.3GB
Company Download 1.9GB
EducationalInstitution Download 1.9GB
Event Download 900MB
Film Download 2.8GB
Group Download 1.2GB
HistoricPlace Download 303MB
Infrastructure Download 1.3GB
MeanOfTransportation Download 792MB
OfficeHolder Download 2.0GB
Plant Download 286MB
Single Download 1.5GB
SoccerPlayer Download 721MB
Software Download 1.3GB
TelevisionShow Download 1.1GB
Town Download 932MB
WrittenWork Download 1.8GB

Format

Each domain includes three files {train,valid,test}.jsonl, and each line represents one instance in JSON format. Each instance forms the following structure:

{
    "exid": "train-1-1",
    "input": [  
        "tokenized and uncased sentence_1 from document_1",
        "tokenized and uncased sentence_2 from document_1",
        "...",
        "tokenized and uncased sentence_i from document_j",
        "..."
    ],
    "targets": [ 
        ["a_1", "tokenized and uncased aspect-based summary for a_1"],
        ["a_2", "tokenized and uncased aspect-based summary for a_2"],
        "..."
    ]
}

where,

  • exid: str
  • input: List[str]
  • targets: List[Tuple[str,str]]

Here, input is the cited references and consists of tokenized sentences (with NLTK). The targets key points to a list of aspect-based summaries, where each element is a pair of a) the target aspect and b) the aspect-based summary.

Inheriting from the base corpus, this dataset exhibits the following characteristics:

  • Cited references are composed of multiple documents, but the document boundaries are lost, thus expressed simply in terms of list of sentences.
  • Sentences in the cited references (input) are tokenized using NLTK.
  • The number of target summaries for each instance varies.

Citation

If you use the dataset, please consider citing with

@article{hayashi20tacl,
    title = {WikiAsp: A Dataset for Multi-domain Aspect-based Summarization},
    author = {Hiroaki Hayashi and Prashant Budania and Peng Wang and Chris Ackerson and Raj Neervannan and Graham Neubig},
    journal = {Transactions of the Association for Computational Linguistics (TACL)},
    month = {},
    url = {https://arxiv.org/abs/2011.07832},
    year = {2020}
}

LICENSE

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

wikiasp's People

Contributors

rooa avatar

Stargazers

Shuzheng Si avatar  avatar Javier Alejandro Oramas López avatar Prakamya mishra avatar Mike Bybee avatar David Lewis avatar Dennis Aumiller avatar  avatar Kiril Gashteovski avatar Zhiqing Sun avatar Kaiqiang Song avatar ZHANG Shitou avatar Apoorv Saxena avatar  avatar Yash Kumar Atri avatar Hyun-Je Song avatar WangQing avatar Liat Schiff avatar Soichiro Murakami avatar Yui Oka avatar  avatar Wilson Lee avatar peco avatar Xiyang (Sean) Hu avatar Taha Seyedsadr avatar Peter avatar Aiah avatar Casey Hilland avatar Xinnian Liang avatar katnoria avatar  avatar Ivan Bilan avatar luzhiyuan avatar  avatar Seder(方进) avatar 爱可可-爱生活 avatar Dhruv Naik avatar  avatar Astariul avatar

Watchers

Graham Neubig avatar James Cloos avatar  avatar Taha Seyedsadr avatar  avatar

Forkers

sts-sadr kidalee

wikiasp's Issues

Mapping to Wikipedia article title

Hello,
Thank you for open sourcing this dataset.

  1. I wonder if there is any meta data to map the data instances to Wiki title?

  2. Also, when following the process of WikiSum did you only use the Commoncrawl part of the data and not the web?

Thanks,

js code in input & empty outputs

I've went through some of the data in different domains. There are huge portion of inputs contain the javascripts code and urls crwaled from the website, and a large porition of the outpus are just empty (target summary). Could you please provide any cleaned version of the data or filtered version of data?

Request: add dataset as Git LFS

Hello, I am trying to add your dataset into hugginface datasets library (https://github.com/huggingface/datasets).
Currently, we are unable to download to files directly from releases because of the way GitHub manages (i.e stores) the assets.

Would it be possible for you to instead add the files as LFS (https://git-lfs.github.com/)?

git lfs track "*.bz2"
mkdir data
# copy bz2 files in the data directory
git add data
git commit 

I would have submitted the pull request but GitHub does not allow adding LFS to the forks 😯

request for metadata

Dear Author,

Thank you for the dataset!

I want to know if list of aspects in the "targets" are according to the order that they appear in the Wikipedia article?
If not, is there some form of meta-data to indicate the corresponding Wikipedia article link for the "targets"?
Do you also have the URLs used for the "inputs" for each data point?

Thank you!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.