Giter VIP home page Giter VIP logo

recodeflow's People

Contributors

cbjerke avatar dougmanuel avatar lurn avatar reikookamoto avatar rhan43 avatar rvyuha avatar wyusuf068 avatar yulric avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

rafdoodle

recodeflow's Issues

Eliminate redundancy surrounding `rec_with_table`

I have been browsing through the source code of flow packages and noticed that the rec_with_table function is defined in multiple places:

After chatting with @DougManuel and @yulric, it appears that this duplication is due to the order in which the different flow packages were developed (i.e., cchsflow, elderflow, followed by recodeflow). Ideally, some refactoring needs to be done such that database-specific flow packages are mainly concerned with the variable sheet, details sheet, and derived variable functions, and recodeflow is taken as a dependency such that recodeflow::rec_with_table can be used with consistency.

Error in `file(con, "r")`: cannot open the connection

Test fails on macOS ppc:

R version 4.2.3 (2023-03-15) -- "Shortstop Beagle"
Copyright (C) 2023 The R Foundation for Statistical Computing
Platform: powerpc-apple-darwin10.8.0 (32-bit)

> library(testthat)
> library(recodeflow)
> test_check("recodeflow")
[ FAIL 1 | WARN 1 | SKIP 0 | PASS 0 ]

══ Failed tests ════════════════════════════════════════════════════════════════
── Error ('test-integration.R:6:3'): The PMML file is correctly generated ──────
Error in `file(con, "r")`: cannot open the connection
Backtrace:
    ▆
 1. └─base::readLines(expected_pmml_file, file.info(expected_pmml_file)$size) at test-integration.R:6:2
 2.   └─base::file(con, "r")

[ FAIL 1 | WARN 1 | SKIP 0 | PASS 0 ]
Error: Test failures
Execution halted

Also, there should be testthat.R file in ./tests, otherwise tests are simply skipped.

Support for recoding across multiple variable sets

Problem Description

Currently, recodeflow does not provide native support for handling multiple variables of the same type. This is limiting for researchers dealing with multifaceted data, especially in health.

Requested feature

Enhance the variableStart argument to accept a vector of variables for derived functions.

Current Status

As of now, variableStart accepts only a single variable for derived functions (I believe).

Minimal example

# Sample data
respondentID = c(1, 2, 3),
med1 = c("DrugA", "DrugC", "DrugB"),
med2 = c("DrugB", "DrugA", "DrugC")

# Hypothetical function call and expected output
# ... Your function call here ...
# Expected Return: c(TRUE, TRUE, FALSE)

Proposed change

New derived variable: is_taking_DrugA

Derived function: is_taking_DrugA_func

variableStart: med

Allowed input for med: variable list of length n, where n > 0.

Importance

This kind of batch recoding is crucial for multiple research scenarios:

People may be on multiple drugs, making it important to identify if any drug is of a specific type or class.
Hospitalization data often contains multiple diagnosis or procedure fields.
Socio-linguistic research may require identifying multiple languages spoken by respondents.

Anticipated questions

Why don't we just have a simple variable list, such as the following:
variableStart: med1, med2

This approach could work but has two limitations:

Verbosity: In large datasets like the Canadian Health Measures Survey that we're currently working with, which has 40 medication variables, specifying each variable individually could become excessively verbose and unwieldy.

Error-Prone: When dealing with multiple sets of similar variables (e.g., med1, med2, ..., med40 and last_taken1, last_taken2, ..., last_taken40), the list-based approach can become both verbose and prone to errors, such as omission or incorrect sequencing.

By contrast, allowing variableStart to accept a vector of variable names would provide a more concise and less error-prone way to manage the recoding across multiple similar variables.

Refactoring R/strings.R

You probably have more insight into this but what's the best way to use variables from other files? Do you just source that file or call it from the package namespace, so recodeflow:::variable_details_columns?

Originally posted by @yulric in #24 (comment)

`roles` header for variable table

What about include roles as a base feature of recodeflow, with supporting functions to select variables based on roles?

The purpose of roles and function of roles would be the same as (or similar to) roles in Tidymodels. https://recipes.tidymodels.org/reference/roles.html?q=roles#effects-of-non-standard-roles

This may be an extension of the scope of recodeflow and, hence, merit a good discussion.

What about creating a project and/or milestone for v1.0.0 and consider this feature for that version?

variables.csv: add column "typeStart"

There is one column in the current variables.csv about variable type: "typeEnd" (aka "variableType"). When working with real life data, it is possible to have other data types before data transformation and harmonization. I'd like to suggest adding a new "typeStart" column and likely some functions to transform data from their original form.

include regex

Should we have the provision for regEx in recode 'from'?

This could address issue #14.

Replace `is_equal()` with `identical()`

Consider replacing custom function with identical() from base R.

> identical(1, 2)
[1] FALSE
> identical(1, 1)
[1] TRUE
> identical(1, NA)
[1] FALSE
> identical(NA, NA)
[1] TRUE

This is identical to the behaviour in the example section of the documentation.

Remove PMML functions

Remove all the code that provides PMML functionality. Its not relevant to the purpose of this package, to provide a transparent and reproducible way to harmonize and transform data.

variables.csv: allow more variable types

Following Issues #18: variables.csv: add column "typeStart", I'd like to suggest adding more allowed variables types than categorical or continuous, especially in "typeStart":

  1. For continuous: do we need to separate "integer" from "double/float" (with allowed decimal places)?
  2. For categorical: maybe we need to separate "categorical" from "ordinal" variables. This may also apply to "typeEnd" since they should be treated differently in the following data transformation and analysis.
  3. Text input, like comments of "Other" or other annotations, needs some extra and often user specific transformations.
  4. Date and time: they are often used to derive other continuous variables, like age or duration of disease. But it will require some special packages and functions to deal with this data type.

Suggestions for improving the vignettes

  • Adopt conventions when writing variable, function, and package names in addition to file paths
  • Clarify terminology
    • What is the difference between a variable name and a label, and how does this relate to other programming languages like Stata
    • tester1 and tester2 should be referred to as datasets rather than databases; the latter usually refers to an organized collection of data
  • Clarify the order in which a new user should read the vignettes
  • Proofread for grammatical and spelling errors
  • Reduce redundancy in writing (i.e., how_to_use_recodeflow_with_your_data.Rmd contains the same information as variable_details.Rmd and variables_sheet.Rmd)
  • Make sure each vignette shows a particular workflow from start to finish (e.g., how to organize a variables sheet by walking through a sample, how to fill in a details sheet, how to call the main function to address common tasks)
  • DT package (used across several files) is in maintenance-only mode so consider using another package for making tables

Questions that crossed my mind

  • Can fields be left blank in the sheets?
  • Are there data type checks that confirm that the data in the sheets are of the correct data type?
    • E.g., we wouldn't expect numeric data in the label column of the variables sheet
  • Under what circumstances does the details sheet get updated after calling rec_with_table()?

Applying identical recoding rules to multiple columns

Is there a way to apply the same recode rules to multiple columns? The multiple columns are identical (same possible observations) the only difference is the column name change.

For example, the following data has a language column, which captures the language an individual speaks. There are 132 possible languages. At the moment all languages are captured in one column, but if you let each observation be its own column you get a total of 8 language columns - lang_1, lang_2, .... Lang_8

I would like to recode the languages from text to numbers. E.g., English = 1, ENGLISH = 1, French = 2, etc... All language columns (e.g., lang_1, lang_2, etc) will use the same recoding rules. So the possible results are:
lang_1 = 132 values (English = 1, ENGLISH = 1, French = 2,...)
lang_2 = 132 values (English = 1, ENGLISH = 1, French = 2,...)
....

Currently using recode flow, I would have to make a row in the variable details sheet for each column and values = 8 col *132 values = 1056 rows. The only difference between the rows would be the names of the starting and final columns. E.g., from lang_1 to lang_recoded_1, lang_2 to lang_recoded_2

Attached is example data and a variable_details sheet that contains the recoding structure for individual languages.

language.csv
variable_details_language.csv

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.