Giter VIP home page Giter VIP logo

muld's Introduction

MuLD: The Multitask Long Document Benchmark

MuLD (Multitask Long Document Benchmark) is a set of 6 NLP tasks where the inputs consist of at least 10,000 words. The benchmark covers a wide variety of task types including translation, summarization, question answering, and classification. Additionally there is a range of output lengths from a single word classification label all the way up to an output longer than the input text.

muld_table

This repo contains official code for the paper MuLD: The Multitask Long Document Benchmark.

Quickstart

The easiest method is to use the Huggingface Datasets library:

import datasets
ds = datasets.load_dataset("ghomasHudson/muld", "NarrativeQA")
ds = datasets.load_dataset("ghomasHudson/muld", "HotpotQA")
ds = datasets.load_dataset("ghomasHudson/muld", "Character Archetype Classification")
ds = datasets.load_dataset("ghomasHudson/muld", "OpenSubtitles")
ds = datasets.load_dataset("ghomasHudson/muld", "AO3 Style Change Detection")
ds = datasets.load_dataset("ghomasHudson/muld", "VLSP")

Or by cloning this repo:

import datasets
ds = datasets.load_dataset("./muld.py", "NarrativeQA")
...

Manual Download

If you prefer to download the data files yourself:

Citation

If you use our benchmark please cite the paper:

@InProceedings{hudson-almoubayed:2022:LREC,
  author    = {Hudson, George  and  Al Moubayed, Noura},
  title     = {MuLD: The Multitask Long Document Benchmark},
  booktitle = {Proceedings of the Language Resources and Evaluation Conference},
  month     = {June},
  year      = {2022},
  address   = {Marseille, France},
  publisher = {European Language Resources Association},
  pages     = {3675--3685},
  url       = {https://aclanthology.org/2022.lrec-1.392}
}

Additionally please cite the datasets we used (particularly NarrativeQA, HotpotQA, and Opensubtitles where we directly use their data with limited filtering).

Dataset Metadata

The following table is necessary for this dataset to be indexed by search engines such as Google Dataset Search.

property value
name MuLD
alternateName Multitask Long Document Benchmark
url
description MuLD (Multitask Long Document Benchmark) is a set of 6 NLP tasks where the inputs consist of at least 10,000 words. The benchmark covers a wide variety of task types including translation, summarization, question answering, and classification. Additionally there is a range of output lengths from a single word classification label all the way up to an output longer than the input text.
citation https://arxiv.org/abs/2202.07362
creator
property value
name Thomas Hudson
sameAs https://orcid.org/0000-0003-3562-3593

muld's People

Contributors

ghomashudson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

nouramoubayed

muld's Issues

Baseline Models

Hi, thank you for this nice dataset ! :-)

In your LREC22 paper, you mention that you "[...] provide the data, baseline models, and other code [...]" in this repo. I can currently only find the data -- are you planning to provide the baseline models and other code, too?

License

What license are you using for your dataset? This is the default for github repos without licenses: https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/licensing-a-repository

You're under no obligation to choose a license. However, without a license, the default copyright laws apply, meaning that you retain all rights to your source code and no one may reproduce, distribute, or create derivative works from your work. If you're creating an open source project, we strongly encourage you to include an open source license. The Open Source Guide provides additional guidance on choosing the correct license for your project.

Note: If you publish your source code in a public repository on GitHub, according to the Terms of Service, other users of GitHub.com have the right to view and fork your repository. If you have already created a repository and no longer want users to have access to the repository, you can make the repository private. When you change the visibility of a repository to private, existing forks or local copies created by other users will still exist. For more information, see "Setting repository visibility."

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.