Giter VIP home page Giter VIP logo

usage_change's Introduction

Detecting Words with Usage Change across Corpora

Code used in our ACL 2020 paper titled 'Simple, Interpretable and Stable Method for Detecting Words with Usage Change across Corpora'.

Intro

The problem of comparing two bodies of text and searching for words that differ in their usage between them arises often in digital humanities and computational social science. This is commonly approached by training word embeddings on each corpus, aligning the vector spaces, and looking for words whose cosine distance in the aligned space is large. However, these methods often require extensive filtering of the vocabulary to perform well, and---as we show in this work---result in unstable, and hence less reliable, results. We propose an alternative approach that does not use vector space alignment, and instead considers the neighbors of each word. The method is simple, interpretable and stable. We demonstrate its effectiveness in 9 different setups, considering different corpus splitting criteria (age, gender and profession of tweet authors, time of tweet) and different languages (English, French and Hebrew).

t-SNE visualization of top-50 neighbors from each corpus for word clutch, Gender split with cyan for neighbors only in female corpus and violet for neighbors only in male corpus. Word clutch in gender split

t-SNE visualization of top-50 neighbors from each corpus for word dam, Age split with cyan for neighbors only in older corpus and violet for neighbors only in young corpus. Word dam in age split

Dependencies

  • nltk
  • numpy
  • scipy
  • sklearn
  • matplotlib
  • gensim (for custom datasets)

Quick Start

Use standard splits (e.g. age, gender) studied in our paper

  • Download the data, embeddings, vocab files used in our paper from Drive.
  • Open source/run.sh, set the variables STD_DIR to the path of the extracted data/embeddings and RES_DIR to the path where our method outputs and visualization plots will be stored.
  • Run our detection method on Age split (young vs. old):
bash source/run.sh detect standard young old
(format: bash source/run.sh detect standard <split_a> <split_b>)

will save the most changed words, stability scores and so on at $RES_DIR/detect_young_old_*. Other splits you can specify include: male, female, performer, creator, sports, weekday, weekend, hebrew2014, hebrew2018, french2014 and french2018.

  • Visualize the nearest neighbors of given words (dam, assist) in two subspaces:
bash source/run.sh visualize standard young old dam,assist
(format: bash source/run.sh visualize standard <split_a> <split_b> <words_in_csv>)

will save the two plots for each word as pdf at $RES_DIR/vis_young_old*.

Use custom splits

  • Ensure you preprocess the subcorpus (e.g., tokenization, handling numbers, removing URLs) and keep each sentence in a single line.
  • Open source/run.sh, set the variables STD_DIR to the path of the extracted data and and RES_DIR to the path where our method outputs and visualization plots will be stored.
  • Train word2vec for both the subcorpus (say corp1_posts and corp2_posts):
bash source/run.sh train custom corp1_posts corp1 corp2_posts corp2
(format: bash source/run.sh train custom <custom_split_a_txt_file> <custom_split_a_short_name> <custom_split_b_txt_file> <custom_split_b_short_name>)

will save the embeddings, vocab files at $RES_DIR/corp1* and $RES_DIR/corp2*. Note that corp1 and corp2 are shorthand names for the two subcorpus, mainly to aid bookkeeping.

  • Run our detection method on the custom split:
bash source/run.sh detect custom corp1_posts corp2_posts corp1 corp2
(format: bash source/run.sh detect custom <custom_split_a_txt_file> <custom_split_b_txt_file> <custom_split_a_short_name> <custom_split_b_short_name>)

will save the most changed words, stability scores and so on at $RES_DIR/detect_corp1_corp2_*.

  • Visualize the nearest neighbors of given words (word_a, word_b) in two subspaces:
bash source/run.sh visualize custom corp1_posts corp2_posts corp1 corp2 word_a,word_b
(format: bash source/run.sh visualize custom <custom_split_a_txt_file> <custom_split_b_txt_file> <custom_split_a_short_name> <custom_split_b_short_name> <words_in_csv>)

will save the two plots for each word as pdf at $RES_DIR/vis_corp1_corp2_*.

Cite

If you find this project useful, please cite the paper:

@inproceedings{gonen-etal-2020-simple,
    title = "Simple, Interpretable and Stable Method for Detecting Words with Usage Change across Corpora",
    author = "Gonen, Hila and Jawahar, Ganesh and Seddah, Djam{\'e}  and Goldberg, Yoav",
    booktitle = "Association for Computational Linguistics",
    year = "2020",
    pages = "538--555"
}

Contact

If you have any questions or suggestions, please contact Hila Gonen and Ganesh Jawahar.

License

This repository is GPL-licensed.

usage_change's People

Contributors

ganeshjawahar avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.