Giter VIP home page Giter VIP logo

eur-lex-sum's Introduction

EUR-Lex-Sum: A Multi- and Cross-lingual Dataset for Long-form Summarization in the Legal Domain

Dennis Aumiller*, Ashish Chouhan*, and Michael Gertz
Heidelberg University & SRH Hochschule Heidelberg
contact us at: {aumiller, chouhan, gertz}@informatik.uni-heidelberg.de

Find our dataset on the Huggingface Hub: ๐Ÿค— eur-lex-sum
The data card also provides further insight on the acquisition process (and some limitations) of the data. Please refer to the Huggingface Hub for more information.
A pre-print of our work is available; it has also been accepted at the main conference track of EMNLP 2022, conference proceedings will be available in December 2022.

Installation

Install all necessary dependencies by running

python3 -m pip install -r requirements.txt

after cloning this repository.

This code base provides necessary scripts for the scraping process (Scraping/), as well as the analysis of our corpus (Analysis/) and final baseline experiments (Baselines/).

Comparison to Related Work

For a comparison of language-specific stats, please refer to Table 5 in our pre-print.

Dataset Name Domain Number of Languages Average Tokens in Reference Text Average Tokens in the Summary text (in words) Compression Ratio Dataset
EUR-Lex-Sum - Our Contribution Legal 24 12,200 (EN) 799 (EN) 16 ๐Ÿค—
BillSum (US) Legal 1 1382 2000 characters, Words are not considered as tokens - ๐Ÿค—
BillSum (CA) Legal 1 1684 2000 characters, Words are not considered as tokens - ๐Ÿค—
Global Voices News 15 359 51 - Paperswithcode
WikiLingua WikiHow 18 391 39 - ๐Ÿค—
Xwikis (comparable) Wikipedia 4 945 77 EN: ~12.2 ๐Ÿค—
Xwikis (parallel) Wikipedia 4 972 76 18.35 ๐Ÿค—
Spektrum (Wiki) Wikipedia 2 1559 140 20 GitHub
Spektrum (Spektrum) Scientific 2 2337 361 30 GitHub
CLIDSUM (Chat) Dialogue 3 83,9 20,3 - GitHub
CLIDSUM (Interview) Dialogue 3 1555,4 14,4 - GitHub
MLSUM News 5 (French) FR: 632,39 FR: 29,5 FR: 21,4 ๐Ÿค—
(German) DE: 570,6 DE: 30,36 DE: 18,8
(Spanish) ES: 800,50 ES: 20,71 ES: 38,7
(Russian) RU: 959,4 RU: 14,57 RU: 65,8
(Turkish) TU: 309,18 TU: 22,88 TU: 13,5
(English) EN: 790,24 EN: 55,56 EN: 14,2

Cite our work

If you use the dataset or other parts of this code base, please use the following citation for attribution:

@inproceedings{aumiller-etal-2022-eur,
    title = {{EUR-Lex-Sum: A Multi- and Cross-lingual Dataset for Long-form Summarization in the Legal Domain}},
    author = "Aumiller, Dennis  and
      Chouhan, Ashish  and
      Gertz, Michael",
    booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, United Arab Emirates",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.emnlp-main.519",
    pages = "7626--7639"
}

License Information

Copyright for the editorial content of EUR-Lex website, the summaries of EU legislation and the consolidated texts owned by the EU, are licensed under the Creative Commons Attribution 4.0 International licence, i.e., CC BY 4.0 as mentioned on the official EUR-Lex website. Any data artifacts remain licensed under the CC BY 4.0 license.

License for software component

Per recommendation of the Creative Commons, we apply a separate license to the software component of this repository. We use the standard MIT license for code artifacts.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.