Giter VIP home page Giter VIP logo

cat2cat's Introduction

cat2cat

R build status CRAN codecov Dependencies

Handling an Inconsistent Coded Categorical Variable in a Longitudinal Dataset

Unifying an inconsistent coded categorical variable in a panel/longtitudal dataset.
There is offered the novel cat2cat procedure to map a categorical variable according to a mapping (transition) table between two different time points. The mapping (transition) table should to have a candidate for each category from the targeted for an update period. The main rule is to replicate the observation if it could be assigned to a few categories, then using simple frequencies or modern statistical methods to approximate probabilities of being assigned to each of them.

This algorithm was invented and implemented in the paper by (Nasinski, Majchrowska and Broniatowska (2020)).

For more details please read the paper by (Nasinski, Gajowniczek (2023)).

Please visit the cat2cat webpage for more information

Python Version

Installation

# install.packages("remotes")
remotes::install_github("polkas/cat2cat")
# or
install.packages("cat2cat")

Example

occup dataset is an example of unbalance panel dataset. This is a simulated data although there are applied a real world characteristics from national statistical office survey. The original survey is anonymous and take place every two years.

trans dataset containing mappings (transitions) between old (2008) and new (2010) occupational codes. This table could be used to map encodings in both directions.

Panel dataset without the unique identifiers and only two periods, backward and simple frequencies:

library("cat2cat")
data("occup", package = "cat2cat")
data("trans", package = "cat2cat")

occup_old <- occup[occup$year == 2008, ]
occup_new <- occup[occup$year == 2010, ]

occup_simple <- cat2cat(
  data = list(
    old = occup_old, new = occup_new,
    cat_var_old = "code", cat_var_new = "code", time_var = "year"
  ),
  mappings = list(trans = trans, direction = "backward")
)

Panel dataset without the unique identifiers and four periods, backward direction and ml models:

library("cat2cat")
data("occup", package = "cat2cat")
data("trans", package = "cat2cat")

occup_2006 <- occup[occup$year == 2006,]
occup_2008 <- occup[occup$year == 2008,]
occup_2010 <- occup[occup$year == 2010,]
occup_2012 <- occup[occup$year == 2012,]

library("caret")

ml_setup <- list(
  data = occup_2010,
  cat_var = "code",
  method = c("knn"),
  features = c("age", "sex", "edu", "exp", "parttime", "salary"),
  args = list(k = 10, ntree = 50)
)

mappings <- list(trans = trans, direction = "backward")

# ml model performance check
print(cat2cat_ml_run(mappings, ml_setup))

# from 2010 to 2008
occup_back_2008_2010 <- cat2cat(
  data = list(
    old = occup_2008, new = occup_2010, 
    cat_var_old = "code", cat_var_new = "code", time_var = "year"
  ),
  mappings = mappings,
  ml = ml_setup
)

# from 2008 to 2006
occup_back_2006_2008 <- cat2cat(
  data = list(
    old = occup_2006, new = occup_back_2008_2010$old,
    cat_var_new = "g_new_c2c", cat_var_old = "code", time_var = "year"
  ),
  mappings = mappings,
  ml = ml_setup
)

o_2006_new <- occup_back_2006_2008$old
o_2008_new <- occup_back_2008_2010$old # or occup_back_2006_2008$new
o_2010_new <- occup_back_2008_2010$new
o_2012_new <- dummy_c2c(
  occup_2012, cat_var = "code", ml = c("knn")
)

final_data_back <- do.call(
  rbind, 
  list(o_2006_new, o_2008_new, o_2010_new, o_2012_new)
)

# possible processing, leaving only one obs per subject and period
# still it is recommended to leave all replications and use the weights in the statistical models
library(magrittr)
ff <- final_data_back %>% 
  split(.$year) %>% 
  lapply(function(x) cross_c2c(x)) %>% 
  lapply(function(x) 
    prune_c2c(x, column = "wei_cross_c2c", method = "highest1")
  ) %>% 
  do.call(rbind, .)
all.equal(nrow(ff), sum(ff$wei_cross_c2c))
all.equal(nrow(ff), sum(final_data_back$wei_freq_c2c))

More complex examples are presented in the "Get Started" vignette.

cat2cat's People

Contributors

polkas avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

cat2cat's Issues

Wrong df for regression

Nrow(occup_2$old$wei) does not skipping zero probability rows when weighs are used inside lm. So sum of bigger than zero will solve it.

many ml models at once

then not wei_ml_c2c more wei_knn_c2c and wei_rf_c2c. Sb might want to cross this models.
Rewrite the code

Cat2cat optional freq_var

Here this optional variable will be a vector which contains a frequencies for each group. Why we need this e.g. if sb want to make a recursive weights for frequencies.

Check length , unique values,

Ggplot2 function

A plot to visualize a structure of a groups across time and others

Organize old materials

  • prepare case study based on a journal paper materials
  • process code from existing ones for unbalanced panel
  • example and code - 2 different datasets with the same key
  • ...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.