Giter VIP home page Giter VIP logo

gender's Introduction

Gender, an R package

Data sets, historical or otherwise, often contain a list of first names but seldom identify those names by gender. Most techniques for finding gender programmatically, such as the Natural Language Toolkit, rely on lists of male and female names. However, the gender* of names can vary over time. Any data set that covers the normal span of a human life will require a historical method to find gender from names.

This package encodes gender based on names and dates of birth, using either the Social Security Administration's data set of first names by year since 1880 (based on an implementation by Cameron Blevins) or the U.S. Census data from IPUMS for years before 1930 (contributed by Ben Schmidt). By using these data sets instead of lists of male and female names, this package is able to more accurately guess the gender of a name; furthermore it is able to report the proportion of times that a name was male or female for any given range of years.

See also Cameron's implementation of the same concept in a Python script.

Twelve names that changed over time

Installation

To install this package, first install devtools.

Then run the following command:

devtools::install_github("ropensci/gender")

Using the package

The simplest way to use this package is to pass a single name to the gender() function. You can optionally specify a year or range of years to the function. If you specify the years option, the function will calculate the proportion of male and female uses of a name for that time period.

gender("madison")
# returns
#      name proportion_female gender proportion_male
# 1 madison            0.9828 female          0.0172

gender("madison", years = c(1900, 1985))
# returns
#      name proportion_female gender proportion_male
# 1 madison            0.0972   male          0.9028

gender("madison", years = 1985)
#      name proportion_female gender proportion_male
# 1 madison            0.7863 female          0.2137

You probably have a data set with many names. For now this package assumes that you have a data frame with a column name which is a character vector (not a factor) containing all lowercase names. If this does not match your data set, see dplyr and stringr for help. You can pass that data frame to the gender() function, which will add columns for gender and the certainty of that guess to your data frame.

gender(sample_names_data)

Using a data frame you can specify a single year or range of years as in the example above. But you can also specify a column in your data set which contains year of birth associated with the name. For now, this column must be an integer vector (not a numeric vector) name year.

gender(sample_names_data, years = TRUE)

If you prefer to use Kantrowitz corpus of male and female names, you can use the method option.

gender(sample_names_data, method = "kantrowitz")

If you prefer a more minimal output, use the option certainty = FALSE to remove the proportion_male and proportion_female output.

Data

This package includes cleaned-up versions of several data sets. To see the available data sets run the following command:

data(package = "gender")
data(ssa_national)        # returns a data set with 1.6 million rows

The raw data sets used in this package are available here:

License

MIT License, http://lmullen.mit-license.org/

Citation

Eventually Cameron and I will publish an article about this method. In the meantime, you can cite and link to either his Python implementation or my implementation in this R package.

By Lincoln Mullen and contributors.

Note

* Of course in most cases the Social Security Administration data more approximately records the biological category sex rather than the social category gender, since it mostly records names given at birth. But since in most cases researchers will be interested in gender, I've named this package gender, leaving it up to researchers to interpret exactly what the encoded values mean.


gender's People

Contributors

lmullen avatar bmschmidt avatar karthik avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.