Giter VIP home page Giter VIP logo

german_hatespeech_corpus's Introduction

German Hate Speech Corpus

This repository contains several German text instances from different sources (facebook comments, tweets etc.; see section references), which are manually re-annotated as either hate speech (hs), offensive/problematic language (p) or non-hate (n). All files are tab-separated CSV files. The corpus is currently under construction and is subject to change.

Content

HASOC

The HASOC dataset includes 818 German tweets as well as German facebook comments classified as either hate or non-hate. The comments were gathered with a keyword based approach in [1]. As the corpus predominantly contains non-hate, we again sampled from the corpus to obtain a (preliminary) equal ratio of hate and non-hate in this dataset.

Hatr

The Hatr dataset contains 432 text instances that were extracted from hatr.org, a website that collects German hate posts from various German blogs.

German refugees

This dataset contains 469 text instances from Ross et al. [2], which is a corpus on offensive tweets about the refugee crisis. The tweets were gathered with a keyword based approach, the keywords all being hashtags in this case.

GermEval 2018

This dataset contains 2.871 text instances from the tweet corpus described in [3], created as part of the GermEval 2018 Task on the Identification of Offensive Language. For this project, we only included the tweets that were classified as 'OFFENSE'.

POLLY

The POLLY corpus originally contains about 125.000 politically charged German tweets around the time of the German federal election in 2017 [4]. Here, we only re-annotated a sample of tweets (around 4.500) that were previously annotated as 'with hate' to maintain a balanced dataset overall.

Bretschneider/Peters

This dataset contains posts from popular and openly available Facebook pages that are known to attract xenophobia [5]. Currently, the dataset published here contains 3.500 comments to posts from the two pages "Pegida" and "I'm a patriot, not a nazi".

Acknowledgements

The German hate speech project at the University of Würzburg, including the creation of the hate speech dataset presented here, was made possible with funding from the Mapara Stiftung.

Many thanks also go to Lukas Weimer, who previously supervised the project.

References

[1] Mandl, Thomas, Sandip Modha, Prasenjit Majumder, Daksh Patel, Mohana Dave, Chintak Mandlia, and Aditya Patel. "Overview of the HASOC track at FIRE 2019: Hate Speech and Offensive Content Identification in Indo-European Languages." FIRE '19: Proceedings of the 11th Forum for Information Retrieval Evaluation, 2019. 14–17.

[2] Ross, Björn, Michael Rist, Guillermo Carbonell, Benjamin Cabrera, Nils Kurowsky, and Michael Wojatzki. "Measuring the Reliability of Hate Speech Annotations: The Case of the European Refugee Crisis." Proceedings of NLP4CMC III: 3rd Workshop on Natural Language Processing for Computer-Mediated Communication, 2016. 6-9.

[3] Wiegand, Michael, Melanie Siegel, and Josef Ruppenhofer. "Overview of the GermEval 2018 Shared Task on the Identification of Offensive Language." Proceedings of GermEval 2018, 14th Conference on Natural Language Processing (KONVENS 2018), 2018. 1-10.

[4] De Smedt, Tom, and Sylvia Jaki. "The Polly corpus: Online political debate in Germany." Proceedings of the 6th Conference on Computer-Mediated Communication (CMC) and Social Media Corpora (CMC-corpora 2018), 2018.

[5] Bretschneider, Uwe and Ralf Peters. "Detecting Offensive Statements towards Foreigners in Social Media." Proceedings of the 50th Hawaii International Conference on System Sciences, 2017.

german_hatespeech_corpus's People

Contributors

thorahagen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

thorahagen

german_hatespeech_corpus's Issues

reference

Hi @ThoraHagen and thanks for the contribution. Is there a reference I can cite for this dataset?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.