Giter VIP home page Giter VIP logo

fcds's People

Contributors

gadenbuie avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

carvajalrodrigo

fcds's Issues

Rename age_low and age_high arguments

What this

fcds %>%
  filter_age_groups(age_low = 20, age_high = 50)

is really saying is

fcds %>%
  filter_age_groups(greater_than = 20, less_than = 50)

or

fcds %>%
  filter_age_groups(gt = 20, lt = 50)

or

fcds %>%
  filter_age_groups(age_gt = 20, age_lt = 50)

Function for recoding age groups

Should we add a function to make recoding ages easier? (See #45)

The syntax would be

fcds %>%
  recode_age(age_group, breaks = c(20, 50, 60, 85))

and would be equivalent to

fcds %>%
  mutate(
    age_group = case_when(
      age_high < 20 ~ "< 20",
      age_high < 50 ~ "20 - 49",
      age_high < 60 ~ "50 - 64",
      age_high < 85 ~ "65 - 84",
      TRUE ~ "85 +"
    ),
    age_group = fct_reorder(age_group, age_low)
  )

Document, test, finalize names of existing functions

  • expand_age_groups()
    • docs
    • name
    • tests
  • filter_age()
    • docs
    • name
    • tests
  • complete_age_groups()
    • docs
    • name
    • tests
  • standardize_age_groups()
    • docs
    • name
    • tests
  • complete_year_groups()
    • docs
    • name
    • tests
  • add_mid_year()
    • docs
    • name (change to add_year_mid() because appends _mid?)
    • tests
  • get_data()
    • remove? or re-purpose as `fdcs_data()?
  • age_adjust()
    • docs
    • name
    • tests
  • merge_fl_pop()
    • docs
      • should this be exported?
      • if exported, document minimum requirements for outside data
    • name
    • tests
  • age_adjust_finalize()
    • docs (internal)
    • name
    • tests
  • summarize_fcds()
    • docs
    • name (add alias for summarise_fcds()?)
    • tests
  • filter_fcds()
    • docs
    • name
    • tests
  • fcds_const()
    • docs
    • name
    • tests

Utilities

  • group_drop()
    • docs
    • name
    • tests
  • with_ungroup()
    • docs
    • name
    • tests
  • with_retain_groups()
    • docs
    • name
    • tests

Make R CMD check happy

Mostly means fixing warnings where I've used variables in dplyr chains. Some documentation fixes required as well.

Add standard population with alternative age groups

Currently fcds::seer_std_ages uses the same age groups as the FCDS data, namely

> fcds_const("age_group")
 [1] "0 - 4"   "5 - 9"   "10 - 14" "15 - 19" "20 - 24"
 [6] "25 - 29" "30 - 34" "35 - 39" "40 - 44" "45 - 49"
[11] "50 - 54" "55 - 59" "60 - 64" "65 - 69" "70 - 74"
[16] "75 - 79" "80 - 84" "85+"     "Unknown"

The standard population is used in age_adjust() and could be made more flexible by allowing for more flexible age groupings in the standard population.

We could have a standard_population() function (or similar) that makes this process easier, with the default behavior of returning fcds::seer_std_ages as currently formatted.

Histology resources

From Oct 2018 Newsletter

WHO plans to release an ICD-O-3.2 in the future. Until then, we have multiple locations and resources that must be used to determine the best/correct histology including:

Don't add columns in filter_age()

d_age_group doesn't have age_low or age_high, so these should not be in the output.

> d_age_group %>% fcds::filter_age(age_high = 15)
  id age_group age_low age_high
1  1     0 - 4       0        4
2  2     10-14      10       14

empty fcds_vars() should list options

fcds::fcds_vars(NULL)
#> Error: `.x` is empty, and no `.init` supplied

Created on 2019-04-29 by the reprex package (v0.2.1)

Session info
devtools::session_info()
#> ─ Session info ──────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 3.5.2 (2018-12-20)
#>  os       macOS High Sierra 10.13.6   
#>  system   x86_64, darwin15.6.0        
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  ctype    en_US.UTF-8                 
#>  tz       America/New_York            
#>  date     2019-04-29                  
#> 
#> ─ Packages ──────────────────────────────────────────────────────────────
#>  package     * version    date       lib source                         
#>  assertthat    0.2.1      2019-03-21 [1] standard (@0.2.1)              
#>  backports     1.1.4      2019-04-10 [1] standard (@1.1.4)              
#>  callr         3.2.0      2019-03-15 [1] standard (@3.2.0)              
#>  cli           1.1.0      2019-03-19 [1] standard (@1.1.0)              
#>  crayon        1.3.4      2017-09-16 [1] CRAN (R 3.5.0)                 
#>  desc          1.2.0      2018-05-01 [1] CRAN (R 3.5.0)                 
#>  devtools      2.0.1      2018-10-26 [1] standard (@2.0.1)              
#>  digest        0.6.18     2018-10-10 [1] CRAN (R 3.5.0)                 
#>  dplyr         0.8.0.1    2019-02-15 [1] CRAN (R 3.5.2)                 
#>  evaluate      0.13       2019-02-12 [1] CRAN (R 3.5.2)                 
#>  fcds          0.0.5.9006 2019-04-29 [1] local                          
#>  fs            1.2.7      2019-03-19 [1] standard (@1.2.7)              
#>  glue          1.3.1      2019-03-12 [1] standard (@1.3.1)              
#>  highr         0.7        2018-06-09 [1] CRAN (R 3.5.0)                 
#>  htmltools     0.3.6      2017-04-28 [1] CRAN (R 3.5.0)                 
#>  knitr         1.22       2019-03-08 [1] CRAN (R 3.5.2)                 
#>  magrittr      1.5        2014-11-22 [1] CRAN (R 3.5.0)                 
#>  memoise       1.1.0      2017-04-21 [1] CRAN (R 3.5.0)                 
#>  pillar        1.3.1      2018-12-15 [1] CRAN (R 3.5.0)                 
#>  pkgbuild      1.0.2      2018-10-16 [1] CRAN (R 3.5.0)                 
#>  pkgconfig     2.0.2      2018-08-16 [1] CRAN (R 3.5.0)                 
#>  pkgload       1.0.2      2018-10-29 [1] CRAN (R 3.5.0)                 
#>  prettyunits   1.0.2      2015-07-13 [1] CRAN (R 3.5.0)                 
#>  processx      3.3.0      2019-03-10 [1] CRAN (R 3.5.2)                 
#>  ps            1.3.0      2018-12-21 [1] CRAN (R 3.5.0)                 
#>  purrr         0.3.2      2019-03-15 [1] standard (@0.3.2)              
#>  R6            2.4.0      2019-02-14 [1] CRAN (R 3.5.2)                 
#>  Rcpp          1.0.1      2019-03-17 [1] standard (@1.0.1)              
#>  remotes       2.0.4.9000 2019-04-23 [1] Github (r-lib/remotes@1f657ec) 
#>  rlang         0.3.4      2019-04-07 [1] standard (@0.3.4)              
#>  rmarkdown     1.11       2018-12-08 [1] CRAN (R 3.5.0)                 
#>  rprojroot     1.3-2      2018-01-03 [1] CRAN (R 3.5.0)                 
#>  sessioninfo   1.1.1      2018-11-05 [1] standard (@1.1.1)              
#>  stringi       1.4.3      2019-03-12 [1] standard (@1.4.3)              
#>  stringr       1.4.0      2019-02-10 [1] CRAN (R 3.5.2)                 
#>  testthat      2.1.0.9000 2019-04-25 [1] Github (r-lib/testthat@8b8a481)
#>  tibble        2.1.1      2019-03-16 [1] standard (@2.1.1)              
#>  tidyr         0.8.3      2019-03-01 [1] CRAN (R 3.5.2)                 
#>  tidyselect    0.2.5      2018-10-11 [1] CRAN (R 3.5.0)                 
#>  usethis       1.5.0      2019-04-07 [1] CRAN (R 3.5.2)                 
#>  withr         2.1.2      2018-03-15 [1] CRAN (R 3.5.0)                 
#>  xfun          0.5.2      2019-03-14 [1] Github (yihui/xfun@d882a87)    
#>  yaml          2.2.0      2018-07-25 [1] CRAN (R 3.5.0)                 
#> 
#> [1] /Library/Frameworks/R.framework/Versions/3.5/Resources/library

complete_age_groups() should navigate around age_groups being in groups

library(tidyverse)
library(fcds)

fcds <- fcds_load()
#> Loading /Users/4468739/.fcds/fcds_2019-04-22-0844.rds
#> FCDS data checks are not yet implemented.

fcds %>% 
  sample_n(20) %>% 
  group_by(age_group) %>% 
  complete_age_groups()
#> Error: Column `age_group` can't be modified because it's a grouping variable

Created on 2019-04-22 by the reprex package (v0.2.1)

Use factors in built-in population data

Now that standardize_age_groups() exists, run package data through this to avoid spurious "character-factor" joining issues.

Also reduce storage size data by using factors where reasonable, in particular where it is clear that a column can have a limited number of values, e.g. county_name.

Future-proofing

The fcds package is geared toward processing the STAT_dataset_2018.dat data set, which is already outdated. We're in a good position to future-proof the package, but it may need a bit more work.

Where we are now:

  1. All of the processing steps can be specified in the fcds_recoding.yaml so that future updates only require modifications to this file.
  2. fcds_recoding is exposed at the top level in fcds_import(), updating to new specs won't require major API changes.

What we need:

  • fcds_const() should take values from fcds_recoding.yaml (if applicable), rather than being hard-coded
  • Documentation structure may need to be modified to make it easier to document new data specs in a consistent manner
  • Add a helper function to choose the correct data spec file for the desired data set, either by matching against the input file name or allowing the user to specify.
  • Certain tests will probably break
  • Should the example data be updated?

mid_year() should figure out mid year on its own

fcds/R/year.R

Lines 49 to 52 in c9f6638

mid_year <- function(years, sep = "-", offset = 2) {
low_year_regex <- paste0("(\\d{2,4}).*", sep, ".*")
paste(as.integer(sub(low_year_regex, "\\1", years)) + offset)
}

  • replace pattern with glue::glue("(\\d{{2,4}})\\s*{sep}\\s*(\\d{{2,4}})")

  • determine algorithm for finding mid year (floor((max - min) / 2)?)

  • implement stringr::str_match()-esque interface

    regmatches("1994 - 2000", regexec(glue::glue("(\\d{{2,4}})\\s*{sep}\\s*(\\d{{2,4}})", sep = "-"), "1994 - 2000"))
    
  • remove offset or allow over-riding with default offset = NULL

inf rate estimates

library(fcds)

fcds <- fcds_load()

fcds %>% 
   dplyr::filter(cancer_site_group == "Prostate Gland") %>%
   count_fcds(sex = "Male") %>%
   age_adjust()

_min/_max vs _low/_high

Why are these different? 🤦‍♂

  1. age_group -> age_low, age_high
  2. year_group -> year_min, year_max

age_adjust() with keep_age should try harder?

If age groups don't match, then what?

library(tidyverse)
library(fcds)

fcds <- fcds_load()

# work with random subsample
fcds <- fcds %>% group_by(!!!rlang::syms(fcds_vars("demo"))) %>% sample_n(1) %>% ungroup()

If we do the regrouping first, age_adjust() will ultimately fail.

fcds_regrouped <- 
  fcds %>% 
  separate_age_groups() %>% 
  mutate(
    age_group = case_when(
      age_high < 20 ~ "< 20",
      age_high < 50 ~ "20 - 49",
      age_high < 60 ~ "50 - 64",
      age_high < 85 ~ "65 - 84",
      TRUE ~ "85 +"
    ),
    age_group = fct_reorder(age_group, age_low)
  )

fcds_vars(.data = fcds_regrouped, "demo")
#> # A tibble: 14,815 x 8
#>    age_group race  sex   origin marital_status birth_country birth_state
#>    <fct>     <fct> <fct> <fct>  <fct>          <fct>         <fct>      
#>  1 < 20      White Male  Non-H… Married; Unma… US States an… Florida    
#>  2 < 20      White Male  Non-H… Single, Separ… US States an… Florida    
#>  3 < 20      White Male  Non-H… Single, Separ… US States an… Florida    
#>  4 < 20      White Male  Non-H… Single, Separ… US States an… Florida    
#>  5 < 20      White Male  Non-H… Single, Separ… US States an… Florida    
#>  6 < 20      White Male  Non-H… Single, Separ… US States an… Florida    
#>  7 < 20      White Male  Non-H… Single, Separ… US States an… Florida    
#>  8 < 20      White Male  Non-H… Single, Separ… US States an… Other US S…
#>  9 < 20      White Male  Non-H… Single, Separ… US States an… Other US S…
#> 10 < 20      White Male  Non-H… Single, Separ… US States an… Other US S…
#> # … with 14,805 more rows, and 1 more variable: primary_payer <fct>

fcds_regrouped %>% 
  count_fcds() %>% 
  age_adjust(keep_age = TRUE)
#> The age groups in `data` do not match any age groups in
#> `population_standard`.

The current way around this is to do the re-grouping after the age adjustment.

fcds %>% 
  count_fcds() %>% 
  age_adjust(keep_age = TRUE) %>% 
  separate_age_groups() %>%
  group_drop(age_group) %>% 
  mutate(
    age_group = case_when(
      age_high < 20 ~ "< 20",
      age_high < 50 ~ "20 - 49",
      age_high < 60 ~ "50 - 64",
      age_high < 85 ~ "65 - 84",
      TRUE ~ "85 +"
    ),
    age_group = fct_reorder(age_group, age_low)
  ) %>% 
  group_by(age_group, add = TRUE) 
#> # A tibble: 126 x 9
#> # Groups:   year, year_mid, age_group [35]
#>    year  year_mid age_group age_low age_high     n population std_pop
#>    <fct> <chr>    <fct>       <dbl>    <dbl> <int>      <dbl>   <dbl>
#>  1 1981… 1983     < 20            0        4    20     672372  1.90e7
#>  2 1981… 1983     < 20            5        9    17     605665  1.99e7
#>  3 1981… 1983     < 20           10       14    17     712639  2.01e7
#>  4 1981… 1983     < 20           15       19    28     789181  1.98e7
#>  5 1981… 1983     20 - 49        20       24    48     890738  1.83e7
#>  6 1981… 1983     20 - 49        25       29    41     892078  1.77e7
#>  7 1981… 1983     20 - 49        30       34    41     793533  1.95e7
#>  8 1981… 1983     20 - 49        35       39    36     686575  2.22e7
#>  9 1981… 1983     20 - 49        40       44    46     581196  2.25e7
#> 10 1981… 1983     20 - 49        45       49    50     514395  1.98e7
#> # … with 116 more rows, and 1 more variable: w <dbl>

But the re-grouped ages overlap the underlying standard ages, so age_adjust() could have called standardize_age_groups() on the population data relative to the input data to do this for us.

Correct order of `year` factor

fcds %>% pull(year)
## [1] 1981-1985 1981-1985 2001-2005 2001-2005 1996-2000 1981-1985 1991-1995 1991-1995 1991-1995 1991-1995 1991-1995 1986-1990 1996-2000 2001-2005
## [15] 2001-2005 2001-2005 2001-2005 1996-2000 2001-2005 1981-1985 1996-2000 1996-2000 2001-2005 2011-2015 1981-1985
## [ reached getOption("max.print") -- omitted 3330878 entries ]
## attr(,"label")
## [1] Year of Diagnosis (5 year group)
## Levels: 2011-2015 2006-2010 2001-2005 1996-2000 1991-1995 1986-1990 1981-1985

Should count_fcds() complete missing groups?

I'm thinking it should at least have a complete argument. This would fill in missing groups.

I think it would be pretty typical to do

fcds %>%
  filter(...) %>%
  count_fcds() %>%
  age_adjust()

but this won't warn about missing groups and you won't see them until you plot the data and there are missing points/line segments.

On the other hand, it's not hard to do this

fcds %>%
  filter(...) %>%
  count_fcds() %>%
  complete_age_groups() %>%
  age_adjust()

and complete_age_groups() 1) has good defaults to make this easy 2) uses the grouping of input data at that step to also complete those groups.

standardize_age_groups() does too much

standardize_age_groups() does a lot of work to try to resolve age groups. This is good for non-standard age groups, and handles cases like merging "5 - 7" and "8 - 9" into the standard "5 - 9".

But if the age groups perfectly match those in fcds_const("age_groups"), this is far too much work and also results in a scary "NA introduced by coercion" warnings.

year vs year_group

I think it might make more sense to store the grouped years (2011-2015) in year_group and the mid year in year.

  1. Functions that operate on grouped years already have the _year_groups() prefix, i.e. complete_year_groups().

  2. Simplifies joining with external data sources, i.e. by_year can be "year" by default.

Why are there duplicates in ICD-O-3 list?

library(rlang)
library(tidyverse)

fcds::seer_icd_o_3 %>%
  count(
    !!!syms(names(fcds::seer_icd_o_3)), 
    sort = TRUE
  ) %>%
  filter(n > 1)
#> # A tibble: 537 x 5
#>    histology histology_descrip… histology_behav… histology_behavior_…     n
#>    <fct>     <fct>              <fct>            <fct>                <int>
#>  1 982       LYMPHOID LEUKEMIA… 9823/3           Chronic lymphocytic…    82
#>  2 969       FOLLIC. & MARGINA… 9699/3           Marginal zone B-cel…    81
#>  3 800       NEOPLASM           8000/3           Neoplasm, malignant     77
#>  4 800       NEOPLASM           8001/3           Tumor cells, malign…    77
#>  5 800       NEOPLASM           8005/3           Malignant tumor, cl…    76
#>  6 800       NEOPLASM           8002/3           Malignant tumor, sm…    71
#>  7 800       NEOPLASM           8003/3           Malignant tumor, gi…    71
#>  8 800       NEOPLASM           8004/3           Malignant tumor, sp…    71
#>  9 968       ML, LARGE B-CELL,… 9680/3           ML, large B-cell, d…    67
#> 10 959       MALIGNANT LYMPHOM… 9590/3           Malignant lymphoma,…    66
#> # … with 527 more rows
Session info
devtools::session_info()
#> ─ Session info ──────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 3.5.3 (2019-03-11)
#>  os       macOS Mojave 10.14.4        
#>  system   x86_64, darwin15.6.0        
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  ctype    en_US.UTF-8                 
#>  tz       America/New_York            
#>  date     2019-04-24                  
#> 
#> ─ Packages ──────────────────────────────────────────────────────────────
#>  package     * version    date       lib source                                 
#>  fcds          0.0.1.9008 2019-04-24 [1] local                                  
#> 
#> [1] /Library/Frameworks/R.framework/Versions/3.5/Resources/library

Make quotation more invisible

Currently, I use age_var to indicate the variable containing the age or age group in functions like expand_age_group(). Rather than

fcds %>%
  expand_age_group(age_var = my_age_group)

it would probably be better for the user to use age_group or age as appropriate.

fcds %>%
  expand_age_group(age_group = my_age_group)

How should users "hydrate" FCDS data?

The fcds package can't provide the actual FCDS data, but should include:

  1. instructions on how to access
  2. a function for running the import scripts that produces a "clean" FCDS data set
  3. a mechanism for minimizing friction for importing pre-processed FCDS data

Should we cache the imported data somewhere on the user's system? This would let us have a single "get started" workflow and make it possible for a user to consistently reference their imported data. On Linux/Mac this could easily be stored in ~/.fcds, or in an equivalent place on Windows.

I wonder, though, if it's advisable to create copies of restricted data? An alternative solution would be to store the path to the downloaded data using package settings (e.g. via pkgconfig, a global option set in .Rprofile, or an environemnt variable in .Renviron.)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.