gerkelab / fcds Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 1.0 17.24 MB

Process data from the Florida Cancer Data System

Home Page: https://gerkelab.github.io/fcds/

License: Other

R 99.53% CSS 0.47%

fcds's People

Contributors

Stargazers

Watchers

Forkers

carvajalrodrigo

fcds's Issues

Rename age_low and age_high arguments

What this

fcds %>%
  filter_age_groups(age_low = 20, age_high = 50)

is really saying is

fcds %>%
  filter_age_groups(greater_than = 20, less_than = 50)

fcds %>%
  filter_age_groups(gt = 20, lt = 50)

fcds %>%
  filter_age_groups(age_gt = 20, age_lt = 50)

Function for recoding age groups

Should we add a function to make recoding ages easier? (See #45)

The syntax would be

fcds %>%
  recode_age(age_group, breaks = c(20, 50, 60, 85))

and would be equivalent to

fcds %>%
  mutate(
    age_group = case_when(
      age_high < 20 ~ "< 20",
      age_high < 50 ~ "20 - 49",
      age_high < 60 ~ "50 - 64",
      age_high < 85 ~ "65 - 84",
      TRUE ~ "85 +"
    ),
    age_group = fct_reorder(age_group, age_low)
  )

Document, test, finalize names of existing functions

expand_age_groups()
- docs
- name
- tests
filter_age()
- docs
- name
- tests
complete_age_groups()
- docs
- name
- tests
standardize_age_groups()
- docs
- name
- tests
complete_year_groups()
- docs
- name
- tests
add_mid_year()
- docs
- name (change to add_year_mid() because appends _mid?)
- tests
get_data()
- remove? or re-purpose as `fdcs_data()?
age_adjust()
- docs
- name
- tests
merge_fl_pop()
- docs
  - should this be exported?
  - if exported, document minimum requirements for outside data
- name
- tests
age_adjust_finalize()
- docs (internal)
- name
- tests
summarize_fcds()
- docs
- name (add alias for summarise_fcds()?)
- tests
filter_fcds()
- docs
- name
- tests
fcds_const()
- docs
- name
- tests

Utilities

group_drop()
- docs
- name
- tests
with_ungroup()
- docs
- name
- tests
with_retain_groups()
- docs
- name
- tests

county_fips code needs to be three digit zero padded

hispanic origin should match SEER data

hispanic -> origin
Values
- Non-Hispanic
- Hispanic
- Unknown

Add fcds_cache_ls() and fcds_cache_clean()

Might just be me, but would be useful to be able to quickly list and clean outdated cache files.

Make R CMD check happy

Mostly means fixing warnings where I've used variables in dplyr chains. Some documentation fixes required as well.

Add standard population with alternative age groups

Currently fcds::seer_std_ages uses the same age groups as the FCDS data, namely

> fcds_const("age_group")
 [1] "0 - 4"   "5 - 9"   "10 - 14" "15 - 19" "20 - 24"
 [6] "25 - 29" "30 - 34" "35 - 39" "40 - 44" "45 - 49"
[11] "50 - 54" "55 - 59" "60 - 64" "65 - 69" "70 - 74"
[16] "75 - 79" "80 - 84" "85+"     "Unknown"

The standard population is used in age_adjust() and could be made more flexible by allowing for more flexible age groupings in the standard population.

We could have a standard_population() function (or similar) that makes this process easier, with the default behavior of returning fcds::seer_std_ages as currently formatted.

rename population data variables seer_pop_XX

e.g. change seer_fl_pop to seer_pop_fl because we may want to import other states or at least seer_us_pop.

Document and finalize fcds_const()

Cancer center catchment areas

Add internal data for catchment area (counties) of NCI-designated Cancer Centers in Florida.

Should catchment areas of other cancer-treatment locations also be added? Where would I find such a list?

Histology resources

From Oct 2018 Newsletter

WHO plans to release an ICD-O-3.2 in the future. Until then, we have multiple locations and resources that must be used to determine the best/correct histology including:

ICD-O-3 Manual – use your current printed manual
ICD-O-3 Errata & 2011 Updates
- http://www.who.int/classifications/icd/updates/icd03updates/en/
ICD-O-3 Updates for 2018
- https://seer.cancer.gov/icd-o-3/
2018 Solid Tumor MP/H Rules
- https://seer.cancer.gov/tools/solidtumor
Hematopoietic Database On Line
- https://seer.cancer.gov/seertools/hemelymph/
2018 Site-Specific Grade Instructions
- https://www.naaccr.org/SSDI/Grade-Manual.pdf
2018 SEER Site/Type Validation List
- https://seer.cancer.gov/icd-o-3/

merge_fl_pop() should be merge_population_data()

age_adjust() will be population-agnostic but with Florida (FCDS) defaults, and merge_fl_pop() should similarly be abstracted up one level as merge_population_data().

Vignette: Age Adjustment

From the half-finished age adjustment docs

Don't add columns in filter_age()

d_age_group doesn't have age_low or age_high, so these should not be in the output.

> d_age_group %>% fcds::filter_age(age_high = 15)
  id age_group age_low age_high
1  1     0 - 4       0        4
2  2     10-14      10       14

age_adjust() should choose seer_pop_fl_exp_race if origin in grouping

All values of origin are NA in seer_pop_fl because origin is "Not applicable in 1969-2011 W,B,O files"...

So age_adjust() should use seer_pop_fl_exp_race if origin is needed. Or the default should be changed.

Add data checking to loading step

Implement data checking when calling fcds_load().

empty fcds_vars() should list options

fcds::fcds_vars(NULL)
#> Error: `.x` is empty, and no `.init` supplied

^{Created on 2019-04-29 by the reprex package (v0.2.1)}

Session info

devtools::session_info()
#> ─ Session info ──────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 3.5.2 (2018-12-20)
#>  os       macOS High Sierra 10.13.6   
#>  system   x86_64, darwin15.6.0        
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  ctype    en_US.UTF-8                 
#>  tz       America/New_York            
#>  date     2019-04-29                  
#> 
#> ─ Packages ──────────────────────────────────────────────────────────────
#>  package     * version    date       lib source                         
#>  assertthat    0.2.1      2019-03-21 [1] standard (@0.2.1)              
#>  backports     1.1.4      2019-04-10 [1] standard (@1.1.4)              
#>  callr         3.2.0      2019-03-15 [1] standard (@3.2.0)              
#>  cli           1.1.0      2019-03-19 [1] standard (@1.1.0)              
#>  crayon        1.3.4      2017-09-16 [1] CRAN (R 3.5.0)                 
#>  desc          1.2.0      2018-05-01 [1] CRAN (R 3.5.0)                 
#>  devtools      2.0.1      2018-10-26 [1] standard (@2.0.1)              
#>  digest        0.6.18     2018-10-10 [1] CRAN (R 3.5.0)                 
#>  dplyr         0.8.0.1    2019-02-15 [1] CRAN (R 3.5.2)                 
#>  evaluate      0.13       2019-02-12 [1] CRAN (R 3.5.2)                 
#>  fcds          0.0.5.9006 2019-04-29 [1] local                          
#>  fs            1.2.7      2019-03-19 [1] standard (@1.2.7)              
#>  glue          1.3.1      2019-03-12 [1] standard (@1.3.1)              
#>  highr         0.7        2018-06-09 [1] CRAN (R 3.5.0)                 
#>  htmltools     0.3.6      2017-04-28 [1] CRAN (R 3.5.0)                 
#>  knitr         1.22       2019-03-08 [1] CRAN (R 3.5.2)                 
#>  magrittr      1.5        2014-11-22 [1] CRAN (R 3.5.0)                 
#>  memoise       1.1.0      2017-04-21 [1] CRAN (R 3.5.0)                 
#>  pillar        1.3.1      2018-12-15 [1] CRAN (R 3.5.0)                 
#>  pkgbuild      1.0.2      2018-10-16 [1] CRAN (R 3.5.0)                 
#>  pkgconfig     2.0.2      2018-08-16 [1] CRAN (R 3.5.0)                 
#>  pkgload       1.0.2      2018-10-29 [1] CRAN (R 3.5.0)                 
#>  prettyunits   1.0.2      2015-07-13 [1] CRAN (R 3.5.0)                 
#>  processx      3.3.0      2019-03-10 [1] CRAN (R 3.5.2)                 
#>  ps            1.3.0      2018-12-21 [1] CRAN (R 3.5.0)                 
#>  purrr         0.3.2      2019-03-15 [1] standard (@0.3.2)              
#>  R6            2.4.0      2019-02-14 [1] CRAN (R 3.5.2)                 
#>  Rcpp          1.0.1      2019-03-17 [1] standard (@1.0.1)              
#>  remotes       2.0.4.9000 2019-04-23 [1] Github (r-lib/remotes@1f657ec) 
#>  rlang         0.3.4      2019-04-07 [1] standard (@0.3.4)              
#>  rmarkdown     1.11       2018-12-08 [1] CRAN (R 3.5.0)                 
#>  rprojroot     1.3-2      2018-01-03 [1] CRAN (R 3.5.0)                 
#>  sessioninfo   1.1.1      2018-11-05 [1] standard (@1.1.1)              
#>  stringi       1.4.3      2019-03-12 [1] standard (@1.4.3)              
#>  stringr       1.4.0      2019-02-10 [1] CRAN (R 3.5.2)                 
#>  testthat      2.1.0.9000 2019-04-25 [1] Github (r-lib/testthat@8b8a481)
#>  tibble        2.1.1      2019-03-16 [1] standard (@2.1.1)              
#>  tidyr         0.8.3      2019-03-01 [1] CRAN (R 3.5.2)                 
#>  tidyselect    0.2.5      2018-10-11 [1] CRAN (R 3.5.0)                 
#>  usethis       1.5.0      2019-04-07 [1] CRAN (R 3.5.2)                 
#>  withr         2.1.2      2018-03-15 [1] CRAN (R 3.5.0)                 
#>  xfun          0.5.2      2019-03-14 [1] Github (yihui/xfun@d882a87)    
#>  yaml          2.2.0      2018-07-25 [1] CRAN (R 3.5.0)                 
#> 
#> [1] /Library/Frameworks/R.framework/Versions/3.5/Resources/library

Goal: Get core package code coverage to 80%

...and mark #nocov the things I don't plan to add tests for

complete_age_groups() should navigate around age_groups being in groups

library(tidyverse)
library(fcds)

fcds <- fcds_load()
#> Loading /Users/4468739/.fcds/fcds_2019-04-22-0844.rds
#> FCDS data checks are not yet implemented.

fcds %>% 
  sample_n(20) %>% 
  group_by(age_group) %>% 
  complete_age_groups()
#> Error: Column `age_group` can't be modified because it's a grouping variable

^{Created on 2019-04-22 by the reprex package (v0.2.1)}

Use factors in built-in population data

Now that standardize_age_groups() exists, run package data through this to avoid spurious "character-factor" joining issues.

Also reduce storage size data by using factors where reasonable, in particular where it is clear that a column can have a limited number of values, e.g. county_name.

Future-proofing

The fcds package is geared toward processing the STAT_dataset_2018.dat data set, which is already outdated. We're in a good position to future-proof the package, but it may need a bit more work.

Where we are now:

All of the processing steps can be specified in the fcds_recoding.yaml so that future updates only require modifications to this file.
fcds_recoding is exposed at the top level in fcds_import(), updating to new specs won't require major API changes.

What we need:

fcds_const() should take values from fcds_recoding.yaml (if applicable), rather than being hard-coded
Documentation structure may need to be modified to make it easier to document new data specs in a consistent manner
Add a helper function to choose the correct data spec file for the desired data set, either by matching against the input file name or allowing the user to specify.
Certain tests will probably break
Should the example data be updated?

mid_year() should figure out mid year on its own

fcds/R/year.R

Lines 49 to 52 in c9f6638

 mid_year <- function(years, sep = "-", offset = 2) { 

 low_year_regex <- paste0("(\\d{2,4}).*", sep, ".*") 

 paste(as.integer(sub(low_year_regex, "\\1", years)) + offset) 

 }

replace pattern with glue::glue("(\\d{{2,4}})\\s*{sep}\\s*(\\d{{2,4}})")
determine algorithm for finding mid year (floor((max - min) / 2)?)

implement stringr::str_match()-esque interface

regmatches("1994 - 2000", regexec(glue::glue("(\\d{{2,4}})\\s*{sep}\\s*(\\d{{2,4}})", sep = "-"), "1994 - 2000"))

remove offset or allow over-riding with default offset = NULL

Validation

Compare with rates and statistics at https://statecancerprofiles.cancer.gov/quick-profiles/index.php?statename=florida

Add ORCID IDs for authors

@tgerke do you have an ORCID id? Just drop it in this issue if you do.

5-year rate change plot

https://statecancerprofiles.cancer.gov/recenttrend/index.php?0&12&0&9599&001&999&00&0&0&0&1#results

inf rate estimates

library(fcds)

fcds <- fcds_load()

fcds %>% 
   dplyr::filter(cancer_site_group == "Prostate Gland") %>%
   count_fcds(sex = "Male") %>%
   age_adjust()

_min/_max vs _low/_high

Why are these different? 🤦‍♂

age_group -> age_low, age_high
year_group -> year_min, year_max

age_adjust() with keep_age should try harder?

If age groups don't match, then what?

library(tidyverse)
library(fcds)

fcds <- fcds_load()

# work with random subsample
fcds <- fcds %>% group_by(!!!rlang::syms(fcds_vars("demo"))) %>% sample_n(1) %>% ungroup()

If we do the regrouping first, age_adjust() will ultimately fail.

fcds_regrouped <- 
  fcds %>% 
  separate_age_groups() %>% 
  mutate(
    age_group = case_when(
      age_high < 20 ~ "< 20",
      age_high < 50 ~ "20 - 49",
      age_high < 60 ~ "50 - 64",
      age_high < 85 ~ "65 - 84",
      TRUE ~ "85 +"
    ),
    age_group = fct_reorder(age_group, age_low)
  )

fcds_vars(.data = fcds_regrouped, "demo")
#> # A tibble: 14,815 x 8
#>    age_group race  sex   origin marital_status birth_country birth_state
#>    <fct>     <fct> <fct> <fct>  <fct>          <fct>         <fct>      
#>  1 < 20      White Male  Non-H… Married; Unma… US States an… Florida    
#>  2 < 20      White Male  Non-H… Single, Separ… US States an… Florida    
#>  3 < 20      White Male  Non-H… Single, Separ… US States an… Florida    
#>  4 < 20      White Male  Non-H… Single, Separ… US States an… Florida    
#>  5 < 20      White Male  Non-H… Single, Separ… US States an… Florida    
#>  6 < 20      White Male  Non-H… Single, Separ… US States an… Florida    
#>  7 < 20      White Male  Non-H… Single, Separ… US States an… Florida    
#>  8 < 20      White Male  Non-H… Single, Separ… US States an… Other US S…
#>  9 < 20      White Male  Non-H… Single, Separ… US States an… Other US S…
#> 10 < 20      White Male  Non-H… Single, Separ… US States an… Other US S…
#> # … with 14,805 more rows, and 1 more variable: primary_payer <fct>

fcds_regrouped %>% 
  count_fcds() %>% 
  age_adjust(keep_age = TRUE)
#> The age groups in `data` do not match any age groups in
#> `population_standard`.

The current way around this is to do the re-grouping after the age adjustment.

fcds %>% 
  count_fcds() %>% 
  age_adjust(keep_age = TRUE) %>% 
  separate_age_groups() %>%
  group_drop(age_group) %>% 
  mutate(
    age_group = case_when(
      age_high < 20 ~ "< 20",
      age_high < 50 ~ "20 - 49",
      age_high < 60 ~ "50 - 64",
      age_high < 85 ~ "65 - 84",
      TRUE ~ "85 +"
    ),
    age_group = fct_reorder(age_group, age_low)
  ) %>% 
  group_by(age_group, add = TRUE) 
#> # A tibble: 126 x 9
#> # Groups:   year, year_mid, age_group [35]
#>    year  year_mid age_group age_low age_high     n population std_pop
#>    <fct> <chr>    <fct>       <dbl>    <dbl> <int>      <dbl>   <dbl>
#>  1 1981… 1983     < 20            0        4    20     672372  1.90e7
#>  2 1981… 1983     < 20            5        9    17     605665  1.99e7
#>  3 1981… 1983     < 20           10       14    17     712639  2.01e7
#>  4 1981… 1983     < 20           15       19    28     789181  1.98e7
#>  5 1981… 1983     20 - 49        20       24    48     890738  1.83e7
#>  6 1981… 1983     20 - 49        25       29    41     892078  1.77e7
#>  7 1981… 1983     20 - 49        30       34    41     793533  1.95e7
#>  8 1981… 1983     20 - 49        35       39    36     686575  2.22e7
#>  9 1981… 1983     20 - 49        40       44    46     581196  2.25e7
#> 10 1981… 1983     20 - 49        45       49    50     514395  1.98e7
#> # … with 116 more rows, and 1 more variable: w <dbl>

But the re-grouped ages overlap the underlying standard ages, so age_adjust() could have called standardize_age_groups() on the population data relative to the input data to do this for us.

Fix link to SEER Stat age adustment rates

Needs to be https://seer.cancer.gov/seerstat/tutorials/aarates/definition.html

Correct order of `year` factor

fcds %>% pull(year)
## [1] 1981-1985 1981-1985 2001-2005 2001-2005 1996-2000 1981-1985 1991-1995 1991-1995 1991-1995 1991-1995 1991-1995 1986-1990 1996-2000 2001-2005
## [15] 2001-2005 2001-2005 2001-2005 1996-2000 2001-2005 1981-1985 1996-2000 1996-2000 2001-2005 2011-2015 1981-1985
## [ reached getOption("max.print") -- omitted 3330878 entries ]
## attr(,"label")
## [1] Year of Diagnosis (5 year group)
## Levels: 2011-2015 2006-2010 2001-2005 1996-2000 1991-1995 1986-1990 1981-1985

FIPS codes

Currently FIPS codes are taken from the documentation provided by FCDS: https://github.com/GerkeLab/fcds/blob/pkg/data-raw/20%20Appendix%20B%20FIPS%20County%20Codes%20for%20Florida.pdf

We may wish to also import Florida FIPS codes from census.gov and cross-check or validate FCDS data against these.

Fix character/factor joining warnings

in places where joins are used, i.e. join_population(), etc.

Should count_fcds() complete missing groups?

I'm thinking it should at least have a complete argument. This would fill in missing groups.

I think it would be pretty typical to do

fcds %>%
  filter(...) %>%
  count_fcds() %>%
  age_adjust()

but this won't warn about missing groups and you won't see them until you plot the data and there are missing points/line segments.

On the other hand, it's not hard to do this

fcds %>%
  filter(...) %>%
  count_fcds() %>%
  complete_age_groups() %>%
  age_adjust()

and complete_age_groups() 1) has good defaults to make this easy 2) uses the grouping of input data at that step to also complete those groups.

county_fips_fl FIPS column should probably be named "county_fips"

because that's the name in all other data sets

standardize_age_groups() does too much

standardize_age_groups() does a lot of work to try to resolve age groups. This is good for non-standard age groups, and handles cases like merging "5 - 7" and "8 - 9" into the standard "5 - 9".

But if the age groups perfectly match those in fcds_const("age_groups"), this is far too much work and also results in a scary "NA introduced by coercion" warnings.

Vignette: Getting Started

get tibble from dplyr

don't include SEER population data for years that aren't in FCDS

Package data is getting a little bit large, affecting package load times, etc. For 5 year groups, we use the mid-year value for comparison with SEER populations. For example, we use the 1983 SEER population for the 5-year FCDS group "1981-1985".

Rename expand_age_groups() -> separate_age_groups()

I meant separate_ as an extension of tidyr::separate(), but I sometimes confuse tidyr::separate() and tidyr::expand() 🤦‍♂️

Add formula to age_adjust() documentation

The age-adjusted rate for an age group comprised of the ages x through y is calculated using the following formula:

count_fcds()

API
Tests
Documentation

import seer all states population combined data as seer_pop_us

County-Level Population Files - Single-year Age Groups
https://seer.cancer.gov/popdata/download.html

year vs year_group

I think it might make more sense to store the grouped years (2011-2015) in year_group and the mid year in year.

Functions that operate on grouped years already have the _year_groups() prefix, i.e. complete_year_groups().
Simplifies joining with external data sources, i.e. by_year can be "year" by default.

Why are there duplicates in ICD-O-3 list?

library(rlang)
library(tidyverse)

fcds::seer_icd_o_3 %>%
  count(
    !!!syms(names(fcds::seer_icd_o_3)), 
    sort = TRUE
  ) %>%
  filter(n > 1)
#> # A tibble: 537 x 5
#>    histology histology_descrip… histology_behav… histology_behavior_…     n
#>    <fct>     <fct>              <fct>            <fct>                <int>
#>  1 982       LYMPHOID LEUKEMIA… 9823/3           Chronic lymphocytic…    82
#>  2 969       FOLLIC. & MARGINA… 9699/3           Marginal zone B-cel…    81
#>  3 800       NEOPLASM           8000/3           Neoplasm, malignant     77
#>  4 800       NEOPLASM           8001/3           Tumor cells, malign…    77
#>  5 800       NEOPLASM           8005/3           Malignant tumor, cl…    76
#>  6 800       NEOPLASM           8002/3           Malignant tumor, sm…    71
#>  7 800       NEOPLASM           8003/3           Malignant tumor, gi…    71
#>  8 800       NEOPLASM           8004/3           Malignant tumor, sp…    71
#>  9 968       ML, LARGE B-CELL,… 9680/3           ML, large B-cell, d…    67
#> 10 959       MALIGNANT LYMPHOM… 9590/3           Malignant lymphoma,…    66
#> # … with 527 more rows

Session info

devtools::session_info()
#> ─ Session info ──────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 3.5.3 (2019-03-11)
#>  os       macOS Mojave 10.14.4        
#>  system   x86_64, darwin15.6.0        
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  ctype    en_US.UTF-8                 
#>  tz       America/New_York            
#>  date     2019-04-24                  
#> 
#> ─ Packages ──────────────────────────────────────────────────────────────
#>  package     * version    date       lib source                                 
#>  fcds          0.0.1.9008 2019-04-24 [1] local                                  
#> 
#> [1] /Library/Frameworks/R.framework/Versions/3.5/Resources/library

Make quotation more invisible

Currently, I use age_var to indicate the variable containing the age or age group in functions like expand_age_group(). Rather than

fcds %>%
  expand_age_group(age_var = my_age_group)

it would probably be better for the user to use age_group or age as appropriate.

fcds %>%
  expand_age_group(age_group = my_age_group)

How should users "hydrate" FCDS data?

The fcds package can't provide the actual FCDS data, but should include:

instructions on how to access
a function for running the import scripts that produces a "clean" FCDS data set
a mechanism for minimizing friction for importing pre-processed FCDS data

Should we cache the imported data somewhere on the user's system? This would let us have a single "get started" workflow and make it possible for a user to consistently reference their imported data. On Linux/Mac this could easily be stored in ~/.fcds, or in an equivalent place on Windows.

I wonder, though, if it's advisable to create copies of restricted data? An alternative solution would be to store the path to the downloaded data using package settings (e.g. via pkgconfig, a global option set in .Rprofile, or an environemnt variable in .Renviron.)

	mid_year <- function(years, sep = "-", offset = 2) {
	low_year_regex <- paste0("(\\d{2,4}).", sep, ".")
	paste(as.integer(sub(low_year_regex, "\\1", years)) + offset)
	}