Repository for the paper "Automated Hate Speech Detection and the Problem of Offensive Language", ICWSM 2017

License: MIT License

Jupyter Notebook 96.12% Python 3.88%

hatespeech offensive nlp icwsm twitter abuse offensive-language hate-speech natural-language-processing dataset

hate-speech-and-offensive-language's Introduction

Automated Hate Speech Detection and the Problem of Offensive Language

Repository for Thomas Davidson, Dana Warmsley, Michael Macy, and Ingmar Weber. 2017. "Automated Hate Speech Detection and the Problem of Offensive Language." ICWSM. You read the paper here.

NOTE: This repository is no longer actively maintained. Please do not post issues regarding the compatibility of the existing code with new versions of Python or the packages used. I will not accept any pull requests. If you plan to use this data or code in your research, please review the issues, as several Github users have suggested changes or improvements to the codebase.

2019 NEWS

We have a new paper on racial bias in this dataset and others, you can read it here

WARNING: The data, lexicons, and notebooks all contain content that is racist, sexist, homophobic, and offensive in many other ways.

You can find our labeled data in the data directory. We have included them as a pickle file (Python 2.7) and as a CSV. You will also find a notebook in the src directory containing Python 2.7 code to replicate our analyses in the paper and a lexicon in the lexicons directory that we generated to try to more accurately classify hate speech. The classifier directory contains a script, instructions, and the necessary files to run our classifier on new data, a test case is provided.

Please cite our paper in any published work that uses any of these resources.

@inproceedings{hateoffensive,
  title = {Automated Hate Speech Detection and the Problem of Offensive Language},
  author = {Davidson, Thomas and Warmsley, Dana and Macy, Michael and Weber, Ingmar}, 
  booktitle = {Proceedings of the 11th International AAAI Conference on Web and Social Media},
  series = {ICWSM '17},
  year = {2017},
  location = {Montreal, Canada},
  pages = {512-515}
  }

Contact We would also appreciate it if you could fill out this short form if you are interested in using our data so we can keep track of how these data are used and get in contact with researchers working on similar problems.

If you have any questions please contact thomas dot davidson at rutgers dot edu.

hate-speech-and-offensive-language's People

Contributors

Stargazers

Watchers

Forkers

yuan39 shashankg7 deeplearningsky laurajakli mamontovamaria salvu vrmpx junanda athenagoras ml-lab riddlet breakend jessreif andyfou urbelis benakiva cemeiq pratiknarang loretoparisi qqgeogor chenghuige cleitondelima katieslayden wachmann lambamayank2394 erinjseong ricksterz loganstapleton duoergun0729 kaixiany makamus rlbellaire boaerosuke 1987618girl rasica ashleyjiang28 ryanmetz jariojose casszhao konst-int-i afcarl darkj24 brucesecond saumyasinha naushadzaman punyajoy charygao decpaul vspat yfiua thiliniiw rogervaas bdouralzeer prossini dadler6 rohitthapliyal2000 dane1122 jushita sarthusarth subratamal gombru gsig123 antonioverdu kohlishivam sanjeevkmishra wli21 adamsxs pspk steve050798 ibozkurt79 shahidmawan garyzhuge dragon-dane sixingyan ekaterinayashkina sunyancn kamalravi mathematixy igordzreyev binny-mathew ateexd katiehouse3 devanshu17 sumitdas1984 rahulcheeti deepanwitadatta shailjaguptaymca lyhn-cu nagarajansriram nidhimoore putrastti lacoderdebh kimxieca yashs087 sarwar187 twrnrg devilhua sameergupta14 stuartchan oblockton

hate-speech-and-offensive-language's Issues

Research Paper Link Not Working

This following link has no research paper can you share the correct link : https://aaai.org/ocs/ICWSM/ICWSM17/paper/view/15665

Invalid keyword 'encoding' in open

Hi, I'm trying to train a classifier with your method. I've cloned your repository, created a virtualenv, and installed all the packages required to run your classifier.
The results of my pip freeze is this:

certifi==2018.4.16
chardet==3.0.4
idna==2.6
nltk==3.3
numpy==1.14.3
pandas==0.23.0
Pyphen==0.9.4
python-dateutil==2.7.3
pytz==2018.4
repoze.lru==0.7
requests==2.18.4
scikit-learn==0.19.1
scipy==1.1.0
six==1.11.0
sklearn==0.0
textstat==0.4.1
urllib3==1.22
vaderSentiment==3.2.1

I have then run the classifier with the command you provided on the README python classifier.py, but the output is:

Traceback (most recent call last):
  File "classifier.py", line 36, in <module>
    sentiment_analyzer = VS()
  File "/.../vaderSentiment/vaderSentiment.py", line 212, in __init__
    with open(lexicon_full_filepath, encoding='utf-8') as f:
TypeError: 'encoding' is an invalid keyword argument for this function

Would it be also possible for you to share a list of requirements (https://pip.readthedocs.io/en/1.1/requirements.html)?

Thank you

cannot run the classifier into trump_tweets.csv

After having installed all the prerequisite packages, the next message appears after the command python2 classifier.py :
Loading data to classify...
29885 tweets to classify
Loading trained classifier...
/home/user/anaconda2/lib/python2.7/site-packages/sklearn/base.py:311: UserWarning: Trying to unpickle estimator LinearSVC from version 0.18 when using version 0.19.1. This might lead to breaking code or invalid results. Use at your own risk.
UserWarning)
Loading other information...
/home/user/anaconda2/lib/python2.7/site-packages/sklearn/base.py:311: UserWarning: Trying to unpickle estimator TfidfTransformer from version 0.18 when using version 0.19.1. This might lead to breaking code or invalid results. Use at your own risk.
UserWarning)
/home/user/anaconda2/lib/python2.7/site-packages/sklearn/base.py:311: UserWarning: Trying to unpickle estimator TfidfVectorizer from version 0.18 when using version 0.19.1. This might lead to breaking code or invalid results. Use at your own risk.
UserWarning)
Transforming inputs...
Built TF-IDF array
Built POS array
Traceback (most recent call last):
File "classifier.py", line 227, in
X = transform_inputs(tweets, tf_vectorizer, idf_vector, pos_vectorizer)
File "classifier.py", line 176, in transform_inputs
oth_array = get_oth_features(tweets)
File "classifier.py", line 149, in get_oth_features
feats.append(other_features_(t))
File "classifier.py", line 124, in other_features_
syllables = textstat.syllable_count(words) #count syllables in words
File "/home/user/anaconda2/lib/python2.7/site-packages/repoze/lru/init.py", line 348, in cached_wrapper
val = func(*args, **kwargs)
File "/home/user/anaconda2/lib/python2.7/site-packages/textstat/textstat.py", line 63, in syllable_count
word_hyphenated = dic.inserted(word)
File "/home/user/anaconda2/lib/python2.7/site-packages/pyphen/init.py", line 298, in inserted
for position in reversed(self.positions(word)):
File "/home/user/anaconda2/lib/python2.7/site-packages/pyphen/init.py", line 248, in positions
return [i for i in self.hd.positions(word) if self.left <= i <= right]
File "/home/user/anaconda2/lib/python2.7/site-packages/pyphen/init.py", line 197, in positions
pointed_word = '.%s.' % word
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 18: ordinal not in range(128)

Bug in tokenize and basic_tokenize

Hi, I noticed a bug in the tokenizing functions in the notebooks. In the tokenize function the regex used is

tweet = " ".join(re.split("[^a-zA-Z]", tweet.lower())).strip() --> I think the '' should be replaced with a '+' as this particular regex renders the tweet as a list of charachters

Same with the basic_teknize function.

Possible to provide Tweet IDs in the data?

The original dataset on data.world contains the ID number for each tweet. Would it be possible to update this repo to include that information as well?

In my early-stage research, I would like to use the tweet ID for two things:

Checking whether the tweet has been deleted or removed (potentially for violating community policies)
Doing user research on the original tweeter (to try and extract features like gender and political orientation from their other tweets)

Thanks!

Imbalanced dat

I have used your data set. I am using keras and trained using lstm I am getting poor result for 2 classes how have you over come that

Can you update code for classifying with python3.6?

I have issues in loading previous models you have provided with python 3.6

LSA

since you are alrdy doing tfidf, why not do lsa as well?

Error while loading TF-IDFVectorizer

I'm trying to run the classifier on custom data, and I get this error:

Traceback (most recent call last):
File "classifier.py", line 43, in
tf_vectorizer = joblib.load('final_tfidf.pkl')
File "/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/numpy_pickle.py", line 578, in load
obj = _unpickle(fobj, filename, mmap_mode)
File "/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/numpy_pickle.py", line 508, in _unpickle
obj = unpickler.load()
File "/usr/lib/python2.7/pickle.py", line 864, in load
dispatchkey
File "/usr/lib/python2.7/pickle.py", line 1096, in load_global
klass = self.find_class(module, name)
File "/usr/lib/python2.7/pickle.py", line 1132, in find_class
klass = getattr(mod, name)
AttributeError: 'module' object has no attribute 'tokenize'

I'm using python2 and scikit-learn 0.19.1.

Do I need a particular version of scikit-learn to un-pickle the files provided in the repo? If not, any suggestions are welcome and appreciated.

small bug

on classifier/classifier.py

def preprocess(text_string):
    """
    Accepts a text string and replaces:
    1) urls with URLHERE
    2) lots of whitespace with one instance
    3) mentions with MENTIONHERE
    This allows us to get standardized counts of urls and mentions
    Without caring about specific people mentioned
    """
    space_pattern = '\s+'
    giant_url_regex = ('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|'
        '[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
    mention_regex = '@[\w\-]+'
    parsed_text = re.sub(space_pattern, ' ', text_string)
    parsed_text = re.sub(giant_url_regex, '', parsed_text)
    parsed_text = re.sub(mention_regex, '', parsed_text)
    #parsed_text = parsed_text.code("utf-8", errors='ignore')
    return parsed_text

you forgot to add "URLHERE" and "MENTIONHERE" tags

Python 3.6.ipynb : not getting the same results as your notebook

Hi,
I am running your Python 3.6 notebook( src directory) . Howerver I am getting different results to the ones shown in the notebook. My 'M' matrix which is input to the model, has (24783,4023) dimensions as opposed to (24783,11172) shown in your notebook.

I have downloaded your code and then installed all the required packages. Code is running fine. DO I need to do something else to get the same results as shown in your notebook?

My Reults:
print(report)
precision recall f1-score support

       0       0.38      0.46      0.42       164
       1       0.93      0.87      0.90      1905
       2       0.67      0.79      0.73       410

micro avg 0.83 0.83 0.83 2479
macro avg 0.66 0.71 0.68 2479
weighted avg 0.85 0.83 0.84 2479

errors provided from initial code

I am getting error on the two following points:

from textstat.textstat import *

and the message when i am importing this module is the following: "module repoze.lru is missing"

t-davidson / hate-speech-and-offensive-language Goto Github PK