gesistsa / sweater Goto Github PK

👚 Speedy Word Embedding Association Test & Extras using R

License: GNU General Public License v3.0

R 89.56% C++ 8.43% Python 0.36% Shell 1.64%

r wordembedding bias-detection textanalysis

sweater's Introduction

sweater

The goal of sweater (Speedy Word Embedding Association Test & Extras using R) is to test for associations among words in word embedding spaces. The methods provided by this package can also be used to test for unwanted associations, or biases.

The package provides functions that are speedy. They are either implemented in C++, or are speedy but accurate approximation of the original implementation proposed by Caliskan et al (2017). See the benchmark here.

This package provides extra methods such as Relative Norm Distance, Embedding Coherence Test, SemAxis and Relative Negative Sentiment Bias.

If your goal is to reproduce the analysis in Caliskan et al (2017), please consider using the original Java program or the R package cbn by Lowe. To reproduce the analysis in Garg et al (2018), please consider using the original Python program. To reproduce the analysis in Mazini et al (2019), please consider using the original Python program.

Please cite this software as:

Chan, C., (2022). sweater: Speedy Word Embedding Association Test and Extras Using R. Journal of Open Source Software, 7(72), 4036, https://doi.org/10.21105/joss.04036

For a BibTeX entry, use the output from citation(package = "sweater").

Installation

Recommended: install the latest development version

remotes::install_github("gesistsa/sweater")

or the “stable” release

install.packages("sweater")

Notation of a query

All tests in this package use the concept of queries (see Badilla et al., 2020) to study associations in the input word embeddings w. This package uses the “STAB” notation from Brunet et al (2019). [1]

All tests depend on two types of words. The first type, namely, S_words and T_words, is target words (or neutral words in Garg et al). In the case of studying biases, these are words that should have no bias. For instance, the words such as “nurse” and “professor” can be used as target words to study the gender bias in word embeddings. One can also separate these words into two sets, S_words and T_words, to group words by their perceived bias. For example, Caliskan et al. (2017) grouped target words into two groups: mathematics (“math”, “algebra”, “geometry”, “calculus”, “equations”, “computation”, “numbers”, “addition”) and arts (“poetry”, “art”, “dance”, “literature”, “novel”, “symphony”, “drama”, “sculpture”). Please note that also T_words is not always required.

The second type, namely A_words and B_words, is attribute words (or group words in Garg et al). These are words with known properties in relation to the bias that one is studying. For example, Caliskan et al. (2017) used gender-related words such as “male”, “man”, “boy”, “brother”, “he”, “him”, “his”, “son” to study gender bias. These words qualify as attribute words because we know they are related to a certain gender.

It is recommended using the function query() to make a query and calculate_es() to calculate the effect size.

Available methods

Target words	Attribute words	Method	`method` argument	Suggested by `query`?	legacy functions [2]
S_words	A_words	Mean Average Cosine Similarity (Mazini et al. 2019)	“mac”	yes	mac(), mac_es()
S_words	A_words, B_words	Relative Norm Distance (Garg et al. 2018)	“rnd”	yes	rnd(), rnd_es()
S_words	A_words, B_words	Relative Negative Sentiment Bias (Sweeney & Najafian. 2019)	“rnsb”	no	rnsb(), rnsb_es()
S_words	A_words, B_words	Embedding Coherence Test (Dev & Phillips. 2019)	“ect”	no	ect(), ect_es(), plot_ect()
S_words	A_words, B_words	SemAxis (An et al. 2018)	“semaxis”	no	semaxis()
S_words	A_words, B_words	Normalized Association Score (Caliskan et al. 2017)	“nas”	no	nas()
S_words, T_words	A_words, B_words	Word Embedding Association Test (Caliskan et al. 2017)	“weat”	yes	weat(), weat_es(), weat_resampling(), weat_exact()
S_words, T_words	A_words, B_words	Word Embeddings Fairness Evaluation (Badilla et al. 2020)	To be implemented

Example: Mean Average Cosine Similarity

The simplest form of bias detection is Mean Average Cosine Similarity (Mazini et al. 2019). The same method was used in Kroon et al. (2020). googlenews is a subset of the pretrained word2vec word embeddings provided by Google.

By default, the query() function guesses the method you want to use based on the combination of target words and attribute words provided (see the “Suggested?” column in the above table). You can also make this explicit by specifying the method argument. Printing the returned object shows the effect size (if available) as well as the functions that can further process the object: calculate_es and plot. Please read the help file of calculate_es (?calculate_es) on what is the meaning of the effect size for a specific test.

require(sweater)

S1 <- c("janitor", "statistician", "midwife", "bailiff", "auctioneer",
"photographer", "geologist", "shoemaker", "athlete", "cashier",
"dancer", "housekeeper", "accountant", "physicist", "gardener",
"dentist", "weaver", "blacksmith", "psychologist", "supervisor",
"mathematician", "surveyor", "tailor", "designer", "economist",
"mechanic", "laborer", "postmaster", "broker", "chemist", "librarian",
"attendant", "clerical", "musician", "porter", "scientist", "carpenter",
"sailor", "instructor", "sheriff", "pilot", "inspector", "mason",
"baker", "administrator", "architect", "collector", "operator",
"surgeon", "driver", "painter", "conductor", "nurse", "cook",
"engineer", "retired", "sales", "lawyer", "clergy", "physician",
"farmer", "clerk", "manager", "guard", "artist", "smith", "official",
"police", "doctor", "professor", "student", "judge", "teacher",
"author", "secretary", "soldier")

A1 <- c("he", "son", "his", "him", "father", "man", "boy", "himself",
"male", "brother", "sons", "fathers", "men", "boys", "males",
"brothers", "uncle", "uncles", "nephew", "nephews")

## The same as:
## mac_neg <- query(googlenews, S_words = S1, A_words = A1, method = "mac")
mac_neg <- query(googlenews, S_words = S1, A_words = A1)
mac_neg
#> 
#> ── sweater object ──────────────────────────────────────────────────────────────
#> Test type:  mac 
#> Effect size:  0.1375856
#> 
#> ── Functions ───────────────────────────────────────────────────────────────────
#> • `calculate_es()`: Calculate effect size
#> • `plot()`: Plot the bias of each individual word

The returned object is an S3 object. Please refer to the help file of the method for the definition of all slots (in this case: ?mac). For example, the magnitude of bias for each word in S1 is available in the P slot.

sort(mac_neg$P)
#>         sales      designer     economist       manager      clerical 
#>  -0.002892495   0.039197285   0.046155954   0.047322071   0.048912403 
#>      operator administrator        author    auctioneer        tailor 
#>   0.050275206   0.050319552   0.051470909   0.065440629   0.074771460 
#>     secretary     librarian     scientist  statistician         pilot 
#>   0.077506781   0.079040760   0.082535536   0.088000351   0.088337791 
#>     geologist      official     architect        broker     professor 
#>   0.088567238   0.090706054   0.091598653   0.098761198   0.101847166 
#>      engineer     collector         smith       chemist      surveyor 
#>   0.103448025   0.104596505   0.104956871   0.110798023   0.112098241 
#>     inspector        weaver     physicist       midwife    supervisor 
#>   0.112383017   0.113221694   0.114302092   0.115791724   0.118784135 
#>     physician        artist     conductor        clergy         guard 
#>   0.118990813   0.119571390   0.120602413   0.123313906   0.128804364 
#>    accountant    instructor         judge    postmaster         nurse 
#>   0.131700192   0.133135210   0.135238197   0.138497652   0.143781092 
#>          cook     attendant       sheriff        dancer  photographer 
#>   0.145019382   0.149134946   0.149992633   0.150637430   0.151388282 
#>  psychologist       cashier       surgeon mathematician       retired 
#>   0.151908676   0.153591372   0.158348402   0.158969004   0.165010593 
#>         clerk       student        porter      gardener       dentist 
#>   0.165903226   0.167006052   0.172551327   0.173346664   0.174776368 
#>       teacher       athlete       bailiff       painter        driver 
#>   0.175027901   0.176353551   0.176440157   0.176625091   0.181269327 
#>         baker     shoemaker        lawyer    blacksmith        farmer 
#>   0.183320490   0.183548112   0.189963886   0.198764788   0.199243319 
#>         mason        police   housekeeper        sailor      musician 
#>   0.203577329   0.206264491   0.208280255   0.208689761   0.219184802 
#>       janitor      mechanic        doctor       soldier       laborer 
#>   0.220953800   0.224008333   0.226657160   0.238053858   0.251032714 
#>     carpenter 
#>   0.259775292

Example: Relative Norm Distance

This analysis reproduces the analysis in Garg et al (2018), namely Figure 1.

B1 <- c("she", "daughter", "hers", "her", "mother", "woman", "girl",
"herself", "female", "sister", "daughters", "mothers", "women",
"girls", "females", "sisters", "aunt", "aunts", "niece", "nieces"
)

garg_f1 <- query(googlenews, S_words = S1, A_words = A1, B_words = B1)
garg_f1
#> 
#> ── sweater object ──────────────────────────────────────────────────────────────
#> Test type:  rnd 
#> Effect size:  -6.341598
#> 
#> ── Functions ───────────────────────────────────────────────────────────────────
#> • `calculate_es()`: Calculate effect size
#> • `plot()`: Plot the bias of each individual word

The object can be plotted by the function plot to show the bias of each word in S. Words such as “nurse”, “midwife” and “librarian” are more associated with female, as indicated by the positive relative norm distance.

plot(garg_f1)

The effect size is simply the sum of all relative norm distance values (Equation 3 in Garg et al. 2018). It is displayed simply by printing the object. You can also use the function calculate_es to obtain the numeric result.

The more positive effect size indicates that words in S_words are more associated with B_words. As the effect size is negative, it indicates that the concept of occupation is more associated with A_words, i.e. male.

calculate_es(garg_f1)
#> [1] -6.341598

Example: SemAxis

This analysis attempts to reproduce the analysis in An et al. (2018).

You may obtain the word2vec word vectors trained with Trump supporters Reddit from here. This package provides a tiny version of the data small_reddit for reproducing the analysis.

S2 <- c("mexicans", "asians", "whites", "blacks", "latinos")
A2 <- c("respect")
B2 <- c("disrespect")
res <- query(small_reddit, S_words = S2, A_words = A2, B_words = B2, method = "semaxis", l = 1)
plot(res)

Example: Embedding Coherence Test

Embedding Coherence Test (Dev & Phillips, 2019) is similar to SemAxis. The only significant difference is that no “SemAxis” is calculated (the difference between the average word vectors of A_words and B_words). Instead, it calculates two separate axes for A_words and B_words. Then it calculates the proximity of each word in S_words with the two axes. It is like doing two separate mac, but ect averages the word vectors of A_words / B_words first.

It is important to note that P is a 2-D matrix. Hence, the plot is 2-dimensional. Words above the equality line are more associated with B_words and vice versa.

res <- query(googlenews, S_words = S1, A_words = A1, B_words = B1, method = "ect")
res$P
#>           janitor statistician   midwife   bailiff auctioneer photographer
#> A_words 0.3352883   0.13495237 0.1791162 0.2698131 0.10123085    0.2305419
#> B_words 0.2598501   0.08300127 0.3851766 0.2331852 0.06957685    0.2077952
#>          geologist shoemaker   athlete   cashier    dancer housekeeper
#> A_words 0.13817054 0.2842002 0.2607956 0.2340296 0.2282981   0.3205498
#> B_words 0.05101061 0.1850456 0.2570477 0.3171645 0.3508183   0.4610773
#>         accountant  physicist  gardener   dentist    weaver blacksmith
#> A_words  0.2029543 0.17446868 0.2657907 0.2672548 0.1767915  0.3080301
#> B_words  0.1789482 0.08362829 0.2873140 0.2623802 0.2475565  0.1603038
#>         psychologist supervisor mathematician   surveyor     tailor   designer
#> A_words    0.2322444  0.1852041     0.2423898 0.17124643 0.11379186 0.06231389
#> B_words    0.2418605  0.1920407     0.1332954 0.09133125 0.07585015 0.14343468
#>          economist  mechanic   laborer postmaster    broker   chemist librarian
#> A_words 0.07450962 0.3435494 0.3904412  0.2128712 0.1525395 0.1696522 0.1237070
#> B_words 0.04008006 0.1882135 0.3011930  0.2223472 0.1112061 0.1440956 0.3147546
#>         attendant   clerical  musician    porter scientist carpenter    sailor
#> A_words 0.2278508 0.07601974 0.3349666 0.2642203 0.1263250 0.4006367 0.3169384
#> B_words 0.2495253 0.15137979 0.2735083 0.1957056 0.1023058 0.2425019 0.3083380
#>         instructor   sheriff     pilot inspector     mason     baker
#> A_words  0.2034101 0.2256034 0.1339011 0.1741268 0.3154815 0.2847909
#> B_words  0.1903228 0.2029597 0.1112940 0.1272682 0.1585883 0.2981460
#>         administrator    architect collector   operator   surgeon    driver
#> A_words    0.08028339 0.1397101748 0.1572854 0.07317863 0.2337787 0.2733306
#> B_words    0.10544115 0.0008324421 0.1341877 0.08706450 0.1926543 0.2363398
#>           painter conductor     nurse      cook   engineer   retired
#> A_words 0.2703030 0.1832604 0.2187359 0.2278016 0.16052771 0.2494770
#> B_words 0.2413599 0.1034218 0.4470728 0.2849471 0.03511008 0.1146753
#>                sales    lawyer    clergy physician    farmer     clerk
#> A_words -0.006505338 0.2937436 0.1920894 0.1777700 0.3090903 0.2519372
#> B_words  0.032652565 0.2345743 0.2081210 0.1555298 0.2220792 0.3146901
#>            manager     guard    artist      smith  official    police    doctor
#> A_words 0.07080773 0.1948853 0.1819504 0.15938222 0.1300515 0.3116599 0.3413265
#> B_words 0.03393879 0.1344678 0.2274278 0.09691327 0.0743546 0.2590763 0.3390124
#>         professor   student     judge   teacher    author secretary   soldier
#> A_words 0.1604224 0.2540493 0.2008630 0.2675705 0.0828586 0.1211243 0.3599860
#> B_words 0.1368013 0.3299938 0.2493299 0.3567416 0.1224295 0.1220939 0.3076572
plot(res)

Effect size can also be calculated. It is the Spearman Correlation Coefficient of the two rows in P. Higher value indicates more “coherent”, i.e. less bias.

res
#> 
#> ── sweater object ──────────────────────────────────────────────────────────────
#> Test type:  ect 
#> Effect size:  0.7001504
#> 
#> ── Functions ───────────────────────────────────────────────────────────────────
#> • `calculate_es()`: Calculate effect size
#> • `plot()`: Plot the bias of each individual word

Example: Relative Negative Sentiment Bias

This analysis attempts to reproduce the analysis in Sweeney & Najafian (2019).

Please note that the datasets glove_sweeney, bing_pos and bing_neg are not included in the package. If you are interested in reproducing the analysis, the 3 datasets are available from here.

load("tests/testdata/bing_neg.rda")
load("tests/testdata/bing_pos.rda")
load("tests/testdata/glove_sweeney.rda")

S3 <- c("swedish", "irish", "mexican", "chinese", "filipino",
        "german", "english", "french", "norwegian", "american",
        "indian", "dutch", "russian", "scottish", "italian")
sn <- query(glove_sweeney, S_words = S3, A_words = bing_pos, B_words = bing_neg, method = "rnsb")

The analysis shows that indian, mexican, and russian are more likely to be associated with negative sentiment.

plot(sn)

The effect size from the analysis is the Kullback–Leibler divergence of P from the uniform distribution. It is extremely close to the value reported in the original paper (0.6225).

sn
#> 
#> ── sweater object ──────────────────────────────────────────────────────────────
#> Test type:  rnsb 
#> Effect size:  0.6228853
#> 
#> ── Functions ───────────────────────────────────────────────────────────────────
#> • `calculate_es()`: Calculate effect size
#> • `plot()`: Plot the bias of each individual word

Support for Quanteda Dictionaries

rnsb supports quanteda dictionaries as S_words. This support will be expanded to other methods later.

This analysis uses the data from here.

For example, newsmap_europe is an abridged dictionary from the package newsmap (Watanabe, 2018). The dictionary contains keywords of European countries and has two levels: regional level (e.g. Eastern Europe) and country level (e.g. Germany).

load("tests/testdata/newsmap_europe.rda")
load("tests/testdata/dictionary_demo.rda")

require(quanteda)
#> Loading required package: quanteda
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 8 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.
newsmap_europe
#> Dictionary object with 4 primary key entries and 2 nested levels.
#> - [EAST]:
#>   - [BG]:
#>     - bulgaria, bulgarian*, sofia
#>   - [BY]:
#>     - belarus, belarusian*, minsk
#>   - [CZ]:
#>     - czech republic, czech*, prague
#>   - [HU]:
#>     - hungary, hungarian*, budapest
#>   - [MD]:
#>     - moldova, moldovan*, chisinau
#>   - [PL]:
#>     - poland, polish, pole*, warsaw
#>   [ reached max_nkey ... 4 more keys ]
#> - [NORTH]:
#>   - [AX]:
#>     - aland islands, aland island*, alandish, mariehamn
#>   - [DK]:
#>     - denmark, danish, dane*, copenhagen
#>   - [EE]:
#>     - estonia, estonian*, tallinn
#>   - [FI]:
#>     - finland, finnish, finn*, helsinki
#>   - [FO]:
#>     - faeroe islands, faeroe island*, faroese*, torshavn
#>   - [GB]:
#>     - uk, united kingdom, britain, british, briton*, brit*, london
#>   [ reached max_nkey ... 10 more keys ]
#> - [SOUTH]:
#>   - [AD]:
#>     - andorra, andorran*
#>   - [AL]:
#>     - albania, albanian*, tirana
#>   - [BA]:
#>     - bosnia, bosnian*, bosnia and herzegovina, herzegovina, sarajevo
#>   - [ES]:
#>     - spain, spanish, spaniard*, madrid, barcelona
#>   - [GI]:
#>     - gibraltar, gibraltarian*, llanitos
#>   - [GR]:
#>     - greece, greek*, athens
#>   [ reached max_nkey ... 11 more keys ]
#> - [WEST]:
#>   - [AT]:
#>     - austria, austrian*, vienna
#>   - [BE]:
#>     - belgium, belgian*, brussels
#>   - [CH]:
#>     - switzerland, swiss*, zurich, bern
#>   - [DE]:
#>     - germany, german*, berlin, frankfurt
#>   - [FR]:
#>     - france, french*, paris
#>   - [LI]:
#>     - liechtenstein, liechtenstein*, vaduz
#>   [ reached max_nkey ... 3 more keys ]

Country-level analysis

country_level <- rnsb(w = dictionary_demo, S_words = newsmap_europe, A_words = bing_pos, B_words = bing_neg, levels = 2)
plot(country_level)

Region-level analysis

region_level <- rnsb(w = dictionary_demo, S_words = newsmap_europe, A_words = bing_pos, B_words = bing_neg, levels = 1)
plot(region_level)

Comparison of the two effect sizes. Please note the much smaller effect size from region-level analysis. It reflects the evener distribution of P across regions than across countries.

calculate_es(country_level)
#> [1] 0.0796689
calculate_es(region_level)
#> [1] 0.00329434

Example: Normalized Association Score

Normalized Association Score (Caliskan et al., 2017) is similar to Relative Norm Distance above. It was used in Müller et al. (2023).

S3 <- c("janitor", "statistician", "midwife", "bailiff", "auctioneer",
"photographer", "geologist", "shoemaker", "athlete", "cashier",
"dancer", "housekeeper", "accountant", "physicist", "gardener",
"dentist", "weaver", "blacksmith", "psychologist", "supervisor",
"mathematician", "surveyor", "tailor", "designer", "economist",
"mechanic", "laborer", "postmaster", "broker", "chemist", "librarian",
"attendant", "clerical", "musician", "porter", "scientist", "carpenter",
"sailor", "instructor", "sheriff", "pilot", "inspector", "mason",
"baker", "administrator", "architect", "collector", "operator",
"surgeon", "driver", "painter", "conductor", "nurse", "cook",
"engineer", "retired", "sales", "lawyer", "clergy", "physician",
"farmer", "clerk", "manager", "guard", "artist", "smith", "official",
"police", "doctor", "professor", "student", "judge", "teacher",
"author", "secretary", "soldier")
A3 <- c("he", "son", "his", "him", "father", "man", "boy", "himself",
"male", "brother", "sons", "fathers", "men", "boys", "males",
"brothers", "uncle", "uncles", "nephew", "nephews")
B3 <- c("she", "daughter", "hers", "her", "mother", "woman", "girl",
"herself", "female", "sister", "daughters", "mothers", "women",
"girls", "females", "sisters", "aunt", "aunts", "niece", "nieces"
)

nas_f1 <- query(googlenews, S_words= S3, A_words = A3, B_words = B3, method = "nas")
plot(nas_f1)

There is a very strong correlation between NAS and RND.

cor.test(nas_f1$P, garg_f1$P)
#> 
#>  Pearson's product-moment correlation
#> 
#> data:  nas_f1$P and garg_f1$P
#> t = -24.93, df = 74, p-value < 2.2e-16
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  -0.9650781 -0.9148179
#> sample estimates:
#>        cor 
#> -0.9453038

Example: Word Embedding Association Test

This example reproduces the detection of “Math. vs Arts” gender bias in Caliskan et al (2017).

data(glove_math) # a subset of the original GLoVE word vectors

S4 <- c("math", "algebra", "geometry", "calculus", "equations", "computation", "numbers", "addition")
T4 <- c("poetry", "art", "dance", "literature", "novel", "symphony", "drama", "sculpture")
A4 <- c("male", "man", "boy", "brother", "he", "him", "his", "son")
B4 <- c("female", "woman", "girl", "sister", "she", "her", "hers", "daughter")
sw <- query(glove_math, S4, T4, A4, B4)

# extraction of effect size
sw
#> 
#> ── sweater object ──────────────────────────────────────────────────────────────
#> Test type:  weat 
#> Effect size:  1.055015
#> 
#> ── Functions ───────────────────────────────────────────────────────────────────
#> • `calculate_es()`: Calculate effect size
#> • `weat_resampling()`: Conduct statistical test

A note about the effect size

By default, the effect size from the function weat_es is adjusted by the pooled standard deviaion (see Page 2 of Caliskan et al. 2007). The standardized effect size can be interpreted the way as Cohen’s d (Cohen, 1988).

One can also get the unstandardized version (aka. test statistic in the original paper):

## weat_es
calculate_es(sw, standardize = FALSE)
#> [1] 0.02486533

The original implementation assumes equal size of S and T. This assumption can be relaxed by pooling the standard deviaion with sample size adjustment. The function weat_es does it when S and T are of different length.

Also, the effect size can be converted to point-biserial correlation (mathematically equivalent to the Pearson’s product moment correlation).

weat_es(sw, r = TRUE)
#> [1] 0.4912066

Exact test

The exact test described in Caliskan et al. (2017) is also available. But it takes a long time to calculate.

## Don't do it. It takes a long time and is almost always significant.
weat_exact(sw)

Instead, please use the resampling approximation of the exact test. The p-value is very close to the reported 0.018.

weat_resampling(sw)
#> 
#>  Resampling approximation of the exact test in Caliskan et al. (2017)
#> 
#> data:  sw
#> bias = 0.024865, p-value = 0.0171
#> alternative hypothesis: true bias is greater than 7.245425e-05
#> sample estimates:
#>       bias 
#> 0.02486533

How to get help

Read the documentation
Search for issues

Contributing

Contributions in the form of feedback, comments, code, and bug report are welcome.

Fork the source code, modify, and issue a pull request.
Issues, bug reports: File a Github issue.

Code of Conduct

Please note that the sweater project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

References

An, J., Kwak, H., & Ahn, Y. Y. (2018). SemAxis: A lightweight framework to characterize domain-specific word semantics beyond sentiment. arXiv preprint arXiv:1806.05521.
Badilla, P., Bravo-Marquez, F., & Pérez, J. (2020). WEFE: The word embeddings fairness evaluation framework. In Proceedings of the 29 th Intern. Joint Conf. Artificial Intelligence.
Brunet, M. E., Alkalay-Houlihan, C., Anderson, A., & Zemel, R. (2019, May). Understanding the origins of bias in word embeddings. In International Conference on Machine Learning (pp. 803-811). PMLR.
Caliskan, Aylin, Joanna J. Bryson, and Arvind Narayanan. “Semantics derived automatically from language corpora contain human-like biases.” Science 356.6334 (2017): 183-186.
Cohen, J. (1988), Statistical Power Analysis for the Behavioral Sciences, 2nd Edition. Hillsdale: Lawrence Erlbaum.
Dev, S., & Phillips, J. (2019, April). Attenuating bias in word vectors. In The 22nd International Conference on Artificial Intelligence and Statistics (pp. 879-887). PMLR.
Garg, N., Schiebinger, L., Jurafsky, D., & Zou, J. (2018). Word embeddings quantify 100 years of gender and ethnic stereotypes. Proceedings of the National Academy of Sciences, 115(16), E3635-E3644.
Manzini, T., Lim, Y. C., Tsvetkov, Y., & Black, A. W. (2019). Black is to criminal as caucasian is to police: Detecting and removing multiclass bias in word embeddings. arXiv preprint arXiv:1904.04047.
McGrath, R. E., & Meyer, G. J. (2006). When effect sizes disagree: the case of r and d. Psychological methods, 11(4), 386.
Müller, P., Chan, C. H., Ludwig, K., Freudenthaler, R., & Wessler, H. (2023). Differential Racism in the News: Using Semi-Supervised Machine Learning to Distinguish Explicit and Implicit Stigmatization of Ethnic and Religious Groups in Journalistic Discourse. Political Communication, 1-19.
Rosenthal, R. (1991), Meta-Analytic Procedures for Social Research. Newbury Park: Sage
Sweeney, C., & Najafian, M. (2019, July). A transparent framework for evaluating unintended demographic bias in word embeddings. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 1662-1667).
Watanabe, K. (2018). Newsmap: A semi-supervised approach to geographical news classification. Digital Journalism, 6(3), 294-309.

In the pre 0.1.0 version of this package, the package used S, T, A, and B as the main parameters. It was later rejected because the symbol T is hardlinked to the logical value TRUE as a global variable; and it is considered to be a bad style to use the symbol T. Accordingly, they were renamed to S_words, T_words, A_words, and B_words respectively. But in general, please stop using the symbol T to represent TRUE!
Please use the query function. These functions are kept for backward compatibility.

sweater's People

Contributors

Stargazers

Watchers

Forkers

cmaimone crsl4 rnaimehaom yhliu2022

sweater's Issues

Implementing `print.sweater` and `plot.sweater`

As sweater is an S3 object, it would be great to have a print method and a plot method. By doing so, one only needs to do query and it will instantly print the results (let's say effect size)

The plot method is basically the same as plot_bias.

documentation: first example

The first example uses googlenews without introducing it first -- not even that it has word embeddings in it.

Output from the first example isn't explained. I found it confusing.

#> Effect size:  0.1375856
#> 
#> ── Functions ─────────────────────────────────
#> • <calculate_es()>: Calculate effect size
#> • <plot()>: Plot the bias of each individual word

Why are the function names in <>? Why are these listed? Are these next steps that I should take? Do I supply the output of the query function to them?

What is mac_neg$P?

re: openjournals/joss-reviews#4036

Migration sundries

handles
Github actions
Default branch

community guidelines

Add something about how people seeking support with the software should get it -- where can/should people ask questions? It's not very clear from the current contributing section.

re: openjournals/joss-reviews#4036

SemAxis

SemAxis by An et al (2018) is a variant of RND.

The only tricky part is augmenting A and B.

https://github.com/ghdi6758/SemAxis/blob/master/code/semaxis.py

I think I will use the notation in the paper, i.e. l instead of k in the python code.

Reference:

@article{an2018semaxis,
  title={SemAxis: A lightweight framework to characterize domain-specific word semantics beyond sentiment},
  author={An, Jisun and Kwak, Haewoon and Ahn, Yong-Yeol},
  journal={arXiv preprint arXiv:1806.05521},
  year={2018}
}

paper suggestion

Suggestion: intro to the paper would be stronger/clearer with an example of what implicit word biases are -- that neutral seeming words can be more associated with one gender than the other, or one ethnic group over another. Which can lead to...

Just a few sentences.

Not absolutely needed, but it jumps pretty quickly into social science theory without a generally accessible example of what it is.

re: openjournals/joss-reviews#4036

dependency on `text2vec`

text2vec might be archived on Dec 5. The only instance this package uses text2vec is for calculating cosine similarity for SemAxis softening. It is possible to trim this dependency.

paper citations

Looks like the paper cites the R word2vec package, but not the original dataset. Should original dataset be cited? Same for small_reddit, glove_math?

Perhaps list them in the paper as sample word embeddings that are included in the package?

re: openjournals/joss-reviews#4036

Allow S to be a quanteda dictionary - rnsb

It would be better to allow S to be a dictionary:

require(quanteda)
S <- dictionary(list(japanese = c("Japaner". "Japanerin"),
                          korean = c("Koreaner", "Koreanerin")))

And then calculation the bias per word, i.e. Japaner/Japanerin; but aggregate to calculate the multinominal distribution of P by categories (i.e. japanese).

installation instructions

Suggesting people use remotes::install_github("chainsawriot/sweater") instead of devtools::install_github("chainsawriot/sweater") will probably result in fewer issues, as remotes is easier for users to install successfully.

Would also consider rewording: "Or the slightly outdated version from CRAN" -- if you're going to offer via CRAN, that should be kept reasonably up to date, and the github version should be considered the developmental one.

re: openjournals/joss-reviews#4036

Implement a speedy version of Relative Norm Distance

As in https://www.pnas.org/content/115/16/E3635

documentation: citation links

All of the citations to papers in the documentation are great! They'd be even more useful if you linked them, so people could click on the reference to go to it.

re: openjournals/joss-reviews#4036

Implement Bias Silhouette Analysis

https://webis.de/downloads/publications/papers/spliethoever_2021.pdf

documentation: what is the preferred workflow?

It is recommended to use the function query() to make a query and calculate_es() to calculate the effect size. You can also use the functions listed below.

I'm confused why there are both general functions and specific ones -- seems like you should either specify the method as an argument to query() or have to use a query function specific to each method -- not both options.

Are functions mac(), rnd(), nas() etc functions that create a query? If so, it would be great to get "query" in their names to make that clear.

re: openjournals/joss-reviews#4036

"guess" method

I found it confusing in the documentation how the method was being determined. The examples are listed for specific methods, but the query code doesn't specify the method. I see that the default is to guess, which I assume it does based on the combination of STAB inputs provided? But if the examples are supposed to be of a specific method, it would probably be clearer to show code that invokes that method specifically -- what if you change "guess" behavior in the future?

re: openjournals/joss-reviews#4036

Range of effect sizes?

It would be really helpful if the documentation pages for the different methods (or the effect size function pages?) included information on the scale/range/directional of the effect size output. For example, are the values between 0 and 1? What is an example of a lot of bias vs. no bias?

re: openjournals/joss-reviews#4036

Benchmark cosine similarity

To show how quick and slow this basic operation is (vs Python or else).

Some methods dont work when `A_words` and `B_words` with length = 1

TODO

nas
ect

rnsb, nas, and semaxis work.

plotting error

> S4 <- c("math", "algebra", "geometry", "calculus", "equations", "computation", "numbers", "addition")
> T4 <- c("poetry", "art", "dance", "literature", "novel", "symphony", "drama", "sculpture")
> A4 <- c("male", "man", "boy", "brother", "he", "him", "his", "son")
> B4 <- c("female", "woman", "girl", "sister", "she", "her", "hers", "daughter")
> sw <- query(glove_math, S4, T4, A4, B4)
> 
> names(sw)
[1] "S_diff"  "T_diff"  "S_words" "T_words" "A_words" "B_words"
> class(sw)
[1] "sweater" "weat"    "list"   
> plot(sw)
Error in plot_bias(x) : No P slot in the input object x.FALSE

^ I ran the example, then tried to plot. Should this be giving this error?

re: openjournals/joss-reviews#4036

code structure question

Are effect sizes expensive to compute? Looks like a call to query() will compute the effect size automatically and print it out, so my guess is no? Is there a reason not to store the computed effect size in the sweater objects (so it can be referenced later)? With the effect size functions, I'd have to store the effect size for a query in a separate variable from the query object, which makes it easy to get things messed up in the code -- would be nice to have it as part of the result object.

Could still have es functions for convenience of accessing that component of the object if you wanted.

re: openjournals/joss-reviews#4036

Issue of pooled SD in `weat_es()`

Hi, I'm the author of the PsychWordVec package (an integrated toolkit of word embedding research). Your sweater package inspired me a lot when I developed test_WEAT() and test_RND() for my package.

Recently I browsed the source code of your sweater package and found that you used a different method to compute pooled SD in weat_es() if the sample sizes are not balanced (n1 != n2). Particularly, I found this line of code (https://github.com/chainsawriot/sweater/blob/master/R/sweater.R#L87):

pooled_sd <- sqrt(((n1 -1) * S_var) + ((n2 - 1) * T_var)/(n1 + n2 + 2))

I'm sure that in statistics this method is usually used to compute the pooled SD (https://www.statisticshowto.com/pooled-standard-deviation/), no matter whether the sample sizes are balanced or not. However, there are two issues to be addressed:

For balanced sample sizes (n1 == n2), the pooled SD calculated by this method would be inconsistent with that by Caliskan et al.'s (2017) sd(c(S_diff, T_diff)) approach. How could we reconcile them? In my test_WEAT(), to avoid such inconsistency, I just follow Caliskan et al.'s approach regardless of whether n1 is equal to n2.
Indeed, the code of computation is incorrect, with a misplaced pair of parentheses ) + ( before n2 - 1 and a wrong sign in n1 + n2 + 2 (which should be n1 + n2 - 2 instead). This could produce substantially wrong results because it actually computes the square root of the sum of (n1 - 1) * S_var and (n2 - 1) * T_var / (n1 + n2 + 2). The correct one should be
pooled_sd <- sqrt(((n1 - 1) * S_var + (n2 - 1) * T_var) / (n1 + n2 - 2))

Best,
Bruce

Organize functions by query types

set(s) of target words	set(s) of attribute words	Algo	Ref
1	1	MAC	Manzini
1	2	RNSB / RND
2	2	WEAT
2	2	WEFE

Do we have it?

all zero vectors will generate `NaN`

In general, cosine is not a good distance measurement for all-zero vectors. But we can't change that.

https://github.com/chainsawriot/sweater/blob/6aebf710d813033c6d07f0268f12bd3e6badaee5/src/weat.cpp#L14

This will generate "divide by zero" problem because deno_* is zero, the sqrt of deno_* is also zero, and the denominator is zero.

A simple solution is to imitate pytorch to use eps (pytorch uses 1e-8). The denominator should always be positive (due to the squaring and then rooting).

NA behaviours

Entirely not tested!

paper: other R packages?

Are there any other R packages available for bias in word embeddings, or more generally for bias in text corpora? If not, you can state that. If there are, would be good to reference.

The statement that sweater brings together methods that were only reported in papers' supplemental materials is clear -- I get the utility of the package. I just don't know if there's anything else out there or not.

re: openjournals/joss-reviews#4036

speed claims?

I'm not completely clear on what the speed claims are here. What is sweater faster than exactly?

I think benchmark.md is showing that implementing one method with rcpp is faster than implementing the same method in pure R or R running C code?

That doesn't seem sufficient for a speed claim in general, especially with that in the package name.

It's very possible I'm missing something though.

I'll note that, other than what's implied by the package name, the paper doesn't make specific speed/performance claims. Dropping the last sentence of the second paragraph of the paper would, I think, remove all reference to speed. If you don't want to go into full comparisons and benchmarking, it may just be enough to talk about how long sweater takes to do common things, and refrain from saying whether that's fast or slow. Having that information -- expected run times -- is very useful just by itself.

re: openjournals/joss-reviews#4036

Paper: questions on Query section

For this:

sweater uses the concept of query [@badilla2020wefe] to study the biases in $w$. A query contains two or more sets of seed words with at least one set of target words and one set of attribute words. sweater uses the $\mathcal{S}\mathcal{T}\mathcal{A}\mathcal{B}$ notation from @brunet2019understanding to form a query.

Need: concept of a query (missing a)
Why is STAB in mathematical notation?
Are target words and attribute words types of seed words? I think so, but that could be clearer

I would also find a little more info on target and attribute sets helpful. When you're supposed to supply two different sets, what is each supposed to be? What should be in S and what in T? I appreciate the references, and realize this may be complicated. Some type of brief summary here would help though. For example, for A and B, it seems each should be a set of words relevant to a group? Or the endpoints of a scale?

When you say target words shouldn't have bias, does that mean they are the words you're testing for bias?

names for vectors in output object?

Can the S_diff and T_diff components of the output for the method have names? I think each value corresponds to an input term, yes? Would be more useful as named vectors.

> sw$S_diff
[1]  0.003158583  0.003242220  0.001271607  0.031652155  0.003074379  0.016247332  0.035000510 -0.010817083
> sw$T_diff
[1] -0.0265718087  0.0054876842 -0.0523231481 -0.0117847993 -0.0369267966  0.0224587349 -0.0167662057
[8]  0.0003334358

re: openjournals/joss-reviews#4036

implementing relational inner product association (RIPA)

This paper claims that one can hack WEAT by cherry-picking words in A and B. The RIPA can protect against such hacking.

The method RIPA does not appear to be difficult to implement. But the fact that the paper doesn't publish any data bothers me.

Words in A and B must be in pair, e.g.

A <- c("man", "men", "king")
B <- c("woman", "women", "queen")

error printing sweater object

I ran the first example, and then tried to print the mac_neg object:

> mac_neg

── sweater object ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Test type:  mac 
Effect size:  0.1375856 

── Functions ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Error in base::nchar(wide_chars$test, type = "width") : 
  lazy-load database '/Library/Frameworks/R.framework/Versions/4.1/Resources/library/cli/R/sysdata.rdb' is corrupt
In addition: Warning messages:
1: In base::nchar(wide_chars$test, type = "width") :
  restarting interrupted promise evaluation
2: In base::nchar(wide_chars$test, type = "width") :
  internal error -3 in R_decompress1


> packageVersion("cli")
[1] ‘3.1.1’

Not sure if this is just my system? I had the same issue when trying to print other sweater objects that were the result of running the examples.

implementing Embedding Quality Test

It is like a variant of SemAxis.

https://arxiv.org/pdf/1901.07656.pdf

Embedding Quality Test is also possible. And the concept is fun. But it is quite difficult to reproduce because the procedure involves NLTK's WordNet to generate plurals and Synonyms. Wordnet is available for R. But making this package dependent on rJava is no laughing business.

Implement Analogy Tasks (and maybe include the data too) to evaluate the general quality of word embeddings

The 3CosAdd, 3CosMul, LRCor etc. see this paper.

Probably include the data also. See the ACLwiki

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.