Giter VIP home page Giter VIP logo

hate-speech-and-offensive-language's Introduction

Automated Hate Speech Detection and the Problem of Offensive Language

Repository for Thomas Davidson, Dana Warmsley, Michael Macy, and Ingmar Weber. 2017. "Automated Hate Speech Detection and the Problem of Offensive Language." ICWSM. You read the paper here.

NOTE: This repository is no longer actively maintained. Please do not post issues regarding the compatibility of the existing code with new versions of Python or the packages used. I will not accept any pull requests. If you plan to use this data or code in your research, please review the issues, as several Github users have suggested changes or improvements to the codebase.

2019 NEWS

We have a new paper on racial bias in this dataset and others, you can read it here

WARNING: The data, lexicons, and notebooks all contain content that is racist, sexist, homophobic, and offensive in many other ways.

You can find our labeled data in the data directory. We have included them as a pickle file (Python 2.7) and as a CSV. You will also find a notebook in the src directory containing Python 2.7 code to replicate our analyses in the paper and a lexicon in the lexicons directory that we generated to try to more accurately classify hate speech. The classifier directory contains a script, instructions, and the necessary files to run our classifier on new data, a test case is provided.

Please cite our paper in any published work that uses any of these resources.

@inproceedings{hateoffensive,
  title = {Automated Hate Speech Detection and the Problem of Offensive Language},
  author = {Davidson, Thomas and Warmsley, Dana and Macy, Michael and Weber, Ingmar}, 
  booktitle = {Proceedings of the 11th International AAAI Conference on Web and Social Media},
  series = {ICWSM '17},
  year = {2017},
  location = {Montreal, Canada},
  pages = {512-515}
  }

Contact We would also appreciate it if you could fill out this short form if you are interested in using our data so we can keep track of how these data are used and get in contact with researchers working on similar problems.

If you have any questions please contact thomas dot davidson at rutgers dot edu.

hate-speech-and-offensive-language's People

Contributors

ingmarweber avatar t-davidson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hate-speech-and-offensive-language's Issues

Invalid keyword 'encoding' in open

Hi, I'm trying to train a classifier with your method. I've cloned your repository, created a virtualenv, and installed all the packages required to run your classifier.
The results of my pip freeze is this:

certifi==2018.4.16
chardet==3.0.4
idna==2.6
nltk==3.3
numpy==1.14.3
pandas==0.23.0
Pyphen==0.9.4
python-dateutil==2.7.3
pytz==2018.4
repoze.lru==0.7
requests==2.18.4
scikit-learn==0.19.1
scipy==1.1.0
six==1.11.0
sklearn==0.0
textstat==0.4.1
urllib3==1.22
vaderSentiment==3.2.1

I have then run the classifier with the command you provided on the README python classifier.py, but the output is:

Traceback (most recent call last):
  File "classifier.py", line 36, in <module>
    sentiment_analyzer = VS()
  File "/.../vaderSentiment/vaderSentiment.py", line 212, in __init__
    with open(lexicon_full_filepath, encoding='utf-8') as f:
TypeError: 'encoding' is an invalid keyword argument for this function

Would it be also possible for you to share a list of requirements (https://pip.readthedocs.io/en/1.1/requirements.html)?

Thank you

cannot run the classifier into trump_tweets.csv

After having installed all the prerequisite packages, the next message appears after the command python2 classifier.py :
Loading data to classify...
29885 tweets to classify
Loading trained classifier...
/home/user/anaconda2/lib/python2.7/site-packages/sklearn/base.py:311: UserWarning: Trying to unpickle estimator LinearSVC from version 0.18 when using version 0.19.1. This might lead to breaking code or invalid results. Use at your own risk.
UserWarning)
Loading other information...
/home/user/anaconda2/lib/python2.7/site-packages/sklearn/base.py:311: UserWarning: Trying to unpickle estimator TfidfTransformer from version 0.18 when using version 0.19.1. This might lead to breaking code or invalid results. Use at your own risk.
UserWarning)
/home/user/anaconda2/lib/python2.7/site-packages/sklearn/base.py:311: UserWarning: Trying to unpickle estimator TfidfVectorizer from version 0.18 when using version 0.19.1. This might lead to breaking code or invalid results. Use at your own risk.
UserWarning)
Transforming inputs...
Built TF-IDF array
Built POS array
Traceback (most recent call last):
File "classifier.py", line 227, in
X = transform_inputs(tweets, tf_vectorizer, idf_vector, pos_vectorizer)
File "classifier.py", line 176, in transform_inputs
oth_array = get_oth_features(tweets)
File "classifier.py", line 149, in get_oth_features
feats.append(other_features_(t))
File "classifier.py", line 124, in other_features_
syllables = textstat.syllable_count(words) #count syllables in words
File "/home/user/anaconda2/lib/python2.7/site-packages/repoze/lru/init.py", line 348, in cached_wrapper
val = func(*args, **kwargs)
File "/home/user/anaconda2/lib/python2.7/site-packages/textstat/textstat.py", line 63, in syllable_count
word_hyphenated = dic.inserted(word)
File "/home/user/anaconda2/lib/python2.7/site-packages/pyphen/init.py", line 298, in inserted
for position in reversed(self.positions(word)):
File "/home/user/anaconda2/lib/python2.7/site-packages/pyphen/init.py", line 248, in positions
return [i for i in self.hd.positions(word) if self.left <= i <= right]
File "/home/user/anaconda2/lib/python2.7/site-packages/pyphen/init.py", line 197, in positions
pointed_word = '.%s.' % word
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 18: ordinal not in range(128)

Bug in tokenize and basic_tokenize

Hi, I noticed a bug in the tokenizing functions in the notebooks. In the tokenize function the regex used is

tweet = " ".join(re.split("[^a-zA-Z]", tweet.lower())).strip() --> I think the '' should be replaced with a '+' as this particular regex renders the tweet as a list of charachters

Same with the basic_teknize function.

Possible to provide Tweet IDs in the data?

The original dataset on data.world contains the ID number for each tweet. Would it be possible to update this repo to include that information as well?

In my early-stage research, I would like to use the tweet ID for two things:

  • Checking whether the tweet has been deleted or removed (potentially for violating community policies)
  • Doing user research on the original tweeter (to try and extract features like gender and political orientation from their other tweets)

Thanks!

Imbalanced dat

I have used your data set. I am using keras and trained using lstm I am getting poor result for 2 classes how have you over come that

LSA

since you are alrdy doing tfidf, why not do lsa as well?

Error while loading TF-IDFVectorizer

I'm trying to run the classifier on custom data, and I get this error:

Traceback (most recent call last):
File "classifier.py", line 43, in
tf_vectorizer = joblib.load('final_tfidf.pkl')
File "/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/numpy_pickle.py", line 578, in load
obj = _unpickle(fobj, filename, mmap_mode)
File "/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/numpy_pickle.py", line 508, in _unpickle
obj = unpickler.load()
File "/usr/lib/python2.7/pickle.py", line 864, in load
dispatchkey
File "/usr/lib/python2.7/pickle.py", line 1096, in load_global
klass = self.find_class(module, name)
File "/usr/lib/python2.7/pickle.py", line 1132, in find_class
klass = getattr(mod, name)
AttributeError: 'module' object has no attribute 'tokenize'

I'm using python2 and scikit-learn 0.19.1.

Do I need a particular version of scikit-learn to un-pickle the files provided in the repo? If not, any suggestions are welcome and appreciated.

small bug

on classifier/classifier.py

def preprocess(text_string):
    """
    Accepts a text string and replaces:
    1) urls with URLHERE
    2) lots of whitespace with one instance
    3) mentions with MENTIONHERE
    This allows us to get standardized counts of urls and mentions
    Without caring about specific people mentioned
    """
    space_pattern = '\s+'
    giant_url_regex = ('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|'
        '[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
    mention_regex = '@[\w\-]+'
    parsed_text = re.sub(space_pattern, ' ', text_string)
    parsed_text = re.sub(giant_url_regex, '', parsed_text)
    parsed_text = re.sub(mention_regex, '', parsed_text)
    #parsed_text = parsed_text.code("utf-8", errors='ignore')
    return parsed_text

you forgot to add "URLHERE" and "MENTIONHERE" tags

Python 3.6.ipynb : not getting the same results as your notebook

Hi,
I am running your Python 3.6 notebook( src directory) . Howerver I am getting different results to the ones shown in the notebook. My 'M' matrix which is input to the model, has (24783,4023) dimensions as opposed to (24783,11172) shown in your notebook.

I have downloaded your code and then installed all the required packages. Code is running fine. DO I need to do something else to get the same results as shown in your notebook?

My Reults:
print(report)
precision recall f1-score support

       0       0.38      0.46      0.42       164
       1       0.93      0.87      0.90      1905
       2       0.67      0.79      0.73       410

micro avg 0.83 0.83 0.83 2479
macro avg 0.66 0.71 0.68 2479
weighted avg 0.85 0.83 0.84 2479

errors provided from initial code

I am getting error on the two following points:

from textstat.textstat import *

and the message when i am importing this module is the following: "module repoze.lru is missing"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.