xiaoleihuang / multilingual_fairness_lrec Goto Github PK
View Code? Open in Web Editor NEWData and code repository of " Multilingual Fairness Evaluation for Hate Speech Detection ". LREC 2020.
License: Apache License 2.0
Data and code repository of " Multilingual Fairness Evaluation for Hate Speech Detection ". LREC 2020.
License: Apache License 2.0
In rnn.py, cnn.py and lr.py, there are some hard-coded paths which does not exist in the repository, for example, ./data/encode, ./data/split, ./data/indices, ./resources/weights. Also, some paths in the preprocess.py also do not exist. I found that now the data in ./data/indices/ have been split into train/eval/test, so does it mean I don't need to run preprocess.py?
Hi,
I have encountered a serious issue in your data. When you read the data in pandas with a \t
seperator (just like you did), several rows are combined into one for some tweets due to formatting issues of your data.
For instance:
data = pd.read_csv('anonymize/English/corpus.tsv', sep='\t', na_values='x')
print(data.shape)
would print
(83077, 11)
which is the same number of docs value you reported in your manuscript in Table 1.
However, check this out:
print(data.iloc[623].text)
results in
user : yes ! rt user : are you upset about tonight's elimination result ? hashtag 7 for the main ? i call rigged 2015-3-2 male 33.0 Melbourne Victoria Australia white neither
9119338424788223288 4926862304955734550 if you * really * don't like something a gamedev is doing , don't pirate their game . that makes you an asshole . just don't play it . 2015-5-11 female 15.0 Portland Oregon United States white neither
9175196123394046748 -2252236138639149560 user user all he does is attack black men . he hates himself and he doesn't even know it hashtag 2015-5-23 male 44.0 New Orleans Louisiana United States black racism
-7011258209083017299 4926862304955734550 user i hate numpads . 2015-2-19 female 15.0 Portland Oregon United States white neither
3133444189354840744 4926862304955734550 not sure about a stream , but i'll have at the very least a vine of zoe making the announcement . and i'll be livetweeting . 2015-3-4 female 15.0 Portland Oregon United States white neither
5487817575003690724 -5098803017287206708 automotive service manager - coon rapids , mn , 55433 hashtag hashtag rapids pls rt : * * overview : * * tires plus total car … url 2015-5-23 x x x x x x neither
-7442543494653349143 4926862304955734550 user i don't mention the name of the place i go to publicly :) 2015-5-3 female 15.0 Portland Oregon United States white neither
-7954606856165127796 4926862304955734550 this clan chat continues to be hilarious . url 2015-5-8 female 15.0 Portland Oregon United States white neither
9211584801926278231 5811374477814742037 hashtag hashtag hashtag hashtag hashtag hashtag hashtag hashtag hashtag hashtag url 2015-3-11 x x x x x x neither
-860627564535227330 4926862304955734550 a lot of women in tech have had to commit themselves so utterly to their work in order to be taken seriously . he's denying their identity . 2015-2-20 female 15.0 Portland Oregon United States white neither
-8609012007119515273 4926862304955734550 it is a * really bad thing * that now i know blackmilk swims fit well and are super comfy . really , really bad . 2015-2-10 female 15.0 Portland Oregon United States white neither
2032150622664019262 4926862304955734550 user user haha , how true . 2015-2-13 female 15.0 Portland Oregon United States white neither
-9121800105459111559 4926862304955734550 rt user : ok hearing this mask fucker saying the exact same shit that's been screamed at me for 6 months isn't fun anymore ... 2015-2-12 female 15.0 Portland Oregon United States white neither
1966015708529508302 5651239258254581284 catching up on hashtag did nikki & katie get a script to say the things they are saying because i wouldn't be caught dead saying any of that ! 2015-3-2 x x Sydney New South Wales Australia x neither
-4550724697291005892 4926862304955734550 user i was somewhere . maybe ? pink pullover , pink backpack ? 2015-2-11 female 15.0 Portland Oregon United States white neither
147731823526758452 4926862304955734550 on twitter , you don't know who you are talking to . " - oh , this woman couldn't be a software dev . oh lordy .
meaning that the tweet text is all messed up.
There are several examples like this that I can not list one by one.
I strongly suggest that you address this issue. Would be happy to help you.
Thanks a lot for this dataset!
I have realized that your encoding considers subcategories of normal
and neither
as non-hate speech (at least for English). There are several label sub-categories so what exactly is neither
?
Also how about subcategory link
and spam
? They are considered hate speech currently according to your encoding. Your publication does not mention this either.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.