sfirke / janitor Goto Github PK

View Code? Open in Web Editor NEW

1.4K 36.0 130.0 6.83 MB

simple tools for data cleaning in R

Home Page: http://sfirke.github.io/janitor/

License: Other

R 100.00%

data-cleaning data-science data-analysis r pivot-tables dirty-data tabulations spss excel tidyverse

janitor's People

Contributors

Stargazers

Watchers

Forkers

chrishaid rogerfischer grunwald rgknight hydrosquall darioromero genomicsnx ddbs yiyio dsdesrosiers rlugojr rocksailor gvelasq zhaoxiaohe cmiranda16poncehealthsciencesuniversity lwjohnst86 sayuritakeda khunreus radovankavicky gapdata randomeffect mbcann01 brad-cannell xtmgah faadal aespar21 jiayiji kgilds guhjy nanaakwasiabayieboateng liangquanzhou garthtarr sisov ktaranov mhamine baifengbai billdenney nitesh009 petermwandi rathsandeep gvdr hughpham kylehaynes tazinho josiahparry raphaelduarte trongtrd nemochina2008 sandyyy123 jessewei christinatong ajb5d andrewbarros skolenik ktp-forked-repos minghao2016 jonleslie cstepper mustapha-wasseja chengjingfeng conradbm cs-nocturne congca khueyama sorenson-impact mattsq whatturn jonellevillar jzadra ari-nz krlmlr romainfrancois martinctc clinicopath arenberg319 stjordanis keguoh balharbi mirceasauciuc anthonytyler27 rmasiniexpert anekar henryn218 studentmicky mwaurapatrick bbolker anhnguyendepocen jimsforks xvzeng mattroumaya monahton gdutz mgacc0 eveyp ruth-moraa ndrewwm standardgalactic sergemayombo tjgusshy daranzolin

janitor's Issues

tabyl() should print the non-NA N

It's a useful data point for annotations and for working with the numbers by hand, e.g., during a check

add excel_numeric_to_date() example to readme

tabyl() throws warning with some named vectors

j <- structure(c(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, 
                 TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE), .Names = c("104", 
                                                                                         "114", "116", "120", "124", "126", "133", "144", "146", "149", 
                                                                                         "156", "160", "163", "167", "173", "174", "177", "181", "186"
                 ))

j %>% tabyl

104 n percent
1 TRUE 19 1
Warning message:
In names(result)[1] <- var_name :
number of items to replace is not a multiple of replacement length

have tabyl() retain ordering of factor levels

Particularly nice when calling from top_levels

have tabyl() look for attr "label" and print it?

would be swell for SurveyMonkey -> SPSS -> R workflow, to know that q0038_0001 is the question you think it is.

Something like this at the start of the function:

  # print label attribute, if it exists - does not work
  if(!is.null(attr(dat %>% select(...), "label"))) {print(attr(dat %>% select(...), "label"))}

Maybe tabyl() - either the general function or one specialized for SPSS work - should take a labelled class vector and work with that, rather than a data.frame.

have tabyl() take multiple variables and compare them?

This might be better off as a separate function. The idea would be that mtcars %>% tabyl(cyl, gear) returns the result of:

full_join(tabyl(mtcars$cyl), tabyl(mtcars$gear), by = c("mtcars_cyl" = "mtcars_gear")) %>%
  setNames(c("value", "cyl_n", "cyl_percent", "gear_n", "gear_percent"))

# A tibble: 5 x 5
  value cyl_n cyl_percent gear_n gear_percent
  <dbl> <int>       <dbl>  <int>        <dbl>
1     4    11     0.34375     12      0.37500
2     6     7     0.21875     NA           NA
3     8    14     0.43750     NA           NA
4     3    NA          NA     15      0.46875
5     5    NA          NA      5      0.15625

Would it return n? %? Both? Have that be user-specified? If just n, it could take advantage of the adorn_crosstab function I'm working on.

improve inputs of tabyl() - are dots needed?

I have the variable name as a parameter ... so that, when calling on a data.frame, the variable name doesn't need to be quoted and so autocomplete works in RStudio, e.g., iris %>% tabyl(Sep will bring up autocomplete suggestions.

But this SO answer states:

It's not a good idea to use the ... when you know each parameter in advance, however, as it adds some ambiguity and further complication to the argument string (and makes the function signature unclear to any other user).

So maybe I'm using it unnecessarily?

create a two-variable crosstab function

with percentages row-wise, col-wise, total, or neither.

If it can extend the current tabyl without making it too complex, great.

Otherwise have it take two vectors instead of a data.frame?

can tabyl() show empty factor levels?

The way table() does? Seems like a useful option. Would be nice for survey analysis. Lower priority, though.

tabyl broke with dplyr 0.5.0

tabyl(mtcars$mpg)
Error: can't arrange by a matrix

let crosstab be called in a pipeline

Update crosstab() code
Update vignette

Idea from @chrishaid

by creating another function. Here is a crude mockup:

crosstab_df <- function(dat, ...){
  names <- as.list(substitute(list(...)))[-1L]
  names <- unlist(lapply(names, deparse))
  trimmed <- dat %>% select_(.dots = names)
  crosstab(trimmed[[1]], trimmed[[2]])
}

Which works: mtcars %>% crosstab_df(cyl, am) (sort of, variable name is lost in the result)

Do this for tabyl() too as tabyl_df(), right now it's tedious to have a dplyr pipeline with a filter, etc. that I have to interrupt to use tabyl() (or use use_series() which isn't even an option for crosstab()).

tabyl() not working with a vector

tabyl(survey$q0038_0001) fails

tabyl() and crosstab() should return data.frame

So that all rows are visible when it prints to console.. As much as I love tibbles, for exploratory commands like this it's annoying not to see the bottom of the list.

top_2 loses variable name on result

it's called "new_vec" no matter what the input variable is called.

Ex:
top_2(as.factor(mtcars$wt)) %>% View

clean_NA_vec() should be exported

Right now it's middle ground - you can't call it, but you can pull its help page with ?clean_NA_vec. I think it should be exported, in case someone only desires to clean a single vector.

tabyl() should further clean variable names that it returns

For instance tabyl(nchar(combined$Associate's Year Awarded)) yields:

   nchar(combined_`Associate's Year Awarded`)     n      percent valid_percent
                                        <int> <int>        <dbl>         <dbl>
1                                           1     1 8.212203e-05    0.01388889
2                                           3     9 7.390983e-04    0.12500000
3                                           4    27 2.217295e-03    0.37500000
4                                           5    29 2.381539e-03    0.40277778
5                                           9     1 8.212203e-05    0.01388889
6                                          11     2 1.642441e-04    0.02777778

IMO that should be combined_Associates_Year_Awarded for easier subsequent reference

function to compare data.frames that should be the same - variable names and col type consistency

here's my use case - I am writing functions to suck up n state export files.

I can't depend on read.csv's type hinting because I have situations where the nth data file will contain a data type (say, 'K' for grade, where grade had always been integer on the first 10 files) that doesn't play nicely.

To solve that, I'm reading the raw files in as character, and then doing type conversion myself.

But this seems like a janitor kind of job

Is this in scope/ out of scope? Any thoughts about how to move forward? @chrishaid would love your thoughts here as well

Let tabyl() be called in a pipeline

Like #33, but with tabyl().

Update code
Add tests
Update documentation

get_dupes should throw better error when wrong variable names are used

mtcars %>% get_dupes(., wt, TRUE)

Error: incorrect size (1), expecting : 5

Is a terrible error message, albeit on a rare case.

More common is:
get_dupes(mtcars, wt, cyll):

Error: unknown column 'cyll'

But I'm relying on later functions to throw that.

extend name truncation to all 3 groups in top_levels()

right now it truncates for the middle category, which is most likely to need it. But add a function to truncate for top and bottom levels - maybe at a greater character limit.

create use_first_valid_of function

For saying, use A if not NA, otherwise use B if not NA, otherwise... could just be a big case_when call (edit: case_when is probably not right given variable # of args, for loop may be simplest). Optionally create a new variable indicating which one was used.

Besides being more readable than nested ifelse, it would work for dates, which is vexing af if you run into this: http://stackoverflow.com/questions/6668963/how-to-prevent-ifelse-from-turning-date-objects-into-numeric-objects

In fact, have it print a warning if you feed it at least one input vector with class Date, suggesting that they specify force_class = "date.

Have crosstab print both percent and n?

Add appropriate number of 0s in percentages, e.g., 0.0% instead of 0%

Not useful for subsequent analysis but handy for examining, and sometimes for sharing. So something like specifying percent = "row", mode = "combined" and it would have in each cell a value like 10.4% (31).

Don't want to excessively complicate things, and it raises the issue of truncation preferences. Having parameters for percent, mode, and digits seems excessive.

I could have a separate function that handles all % related calculations, like:
crosstab(mtcars$cyl, mtcars$gear) %>% table_percentages(., percent = "row", show = "combined", digits = 1)
Though I don't like taking the simple percentage option out of crosstab(). But if I don't, there's redundancy.

Might just write this as a personal function. But I have to imagine it's widely used.

Add totals column and row with adorn_crosstab()

Again, mimicking Excel PivotTables.

I'm most of the way there, but unsure on:

Should I create add_totals_column and add_totals_row functions that will be exported for a user to call on any data.frame?
Right now in the dev branch, the totals column is added for row-wise %s. In Excel, the column added with row-wise %s is the sum of %s in the row, i.e., 100% in each row (intuitive, but useless). But also for row-wise %, it adds a meaningful row at the bottom showing the % that each column represents of the total - useful.

So per that 2nd bullet, I think I need to switch when the totals col or row is displayed. And maybe stick with just displaying one by default, the pure 100%s one in Excel is not useful.

for surveys: create a function that makes a ggplot bar chart of a tabyl() output

for easy knitting

crosstab() should return factor levels on leftmost column, not underlying values

crosstab(survey$q0002, survey$q0038_0001)

gives 1, 2, 3

dplyr has to be loaded for functions to work

get_dupes(mtcars, mpg gives an error if dplyr is not loaded:

Do I have to call magrittr::%>% ? or can/should I make it load dplyr - is that against CRAN principles?

sort parameter for tabyl() not working with factors

x <- factor(c("aaaaaaaa", "bbbbbbbb", "ccccccccddddddddd", "dddddddd", NA, "hhhhhhhh", "bbbbbbbb"), levels = c("dddddddd", "aaaaaaaa", "ccccccccddddddddd", "bbbbbbbb", "hhhhhhhh"))

tabyl(x, sort = TRUE)

redo top_2 to run on factor levels

Then don't need to worry about topics besides agreement - "very confident" etc. Have it spit out top-2, bottom-2, and middle.

Can't crosstab() a variable with itself when piped

mtcars %>% crosstab(cyl, cyl) throws error:

Error in [.data.frame(x, 2) : undefined columns selected

But you can do crosstab(mtcars$cyl, mtcars$cyl) and frankly, I don't think this is a meaningful operation. Use tabyl() instead. Not sure it's worth producing a more helpful error message.

crosstab() takes lists and weird things happen

Can we somehow disable feeding lists as arguments to crosstabs? Not a huge deal in practice, as it's not an intended use, but I'd rather it error than produce a nonsensical result. Look into it.

crosstab() result turns column values 1, 2, 3 into X1, X2, X3

I'm not sure if this is acceptable or not. It gives you a legal variable name in the result, useful for further operations. But it's not as readable, and doesn't make for nice direct presentation, say in knitr::kable().

replace NA values in tabyl() counts with 0?

Only affects calls where there's an unrepresented level in a factor variable, so this is a niche case.

Instead of:

> tabyl(sorted_with_na_and_fac$grp, sort = TRUE)
  sorted_with_na_and_fac$grp  n percent valid_percent
1                          c  2    0.50     0.6666667
2                          a  1    0.25     0.3333333
3                          b NA      NA            NA
4                       <NA>  1    0.25            NA

Return:

> tabyl(sorted_with_na_and_fac$grp, sort = TRUE)
  sorted_with_na_and_fac$grp n percent valid_percent
1                          c 2    0.50     0.6666667
2                          a 1    0.25     0.3333333
3                          b 0    0.00     0.0000000
4                       <NA> 1    0.25     0.0000000

Could likely be done with:

  # replace NA values with zeroes
  result[is.na(result)] <- 0

Though I think I'd want to retain the NA representation of row = NA and column = valid_percent, to show that missing values is not in fact 0% once they've been filtered out.

handle apostrophe-s in clean_names

the variable name employee's name should become employees_name not employee_s_name ... a small thing, but the latter is annoying to type

clean_NA_variants should have an option to drop the default scrubbed values

To give users more control. Maybe a parameter common = TRUE?

chained repeated tabyl calls end in variable name conflict

I wanted to do this:

mtcars %>% tabyl(mpg) %>% tabyl(n)

But got error message:

Error: found duplicated column name: n

Compare to:

mtcars %>% count(mpg) %>% count(n)

generalize top_2() by giving it a lvls parameter

instead of top_2(factor_var) it should be top_levels(factor_var, lvls = 2)

factors to character

Sam, one thing that I've run into recently is output of other packages (especially the stuff that reads in Access files... ugh) that doesn't give fine control to the read.csv parameters - I've been having to take the data frame as it comes, and then flip factors back to character.

Is that something that should live in janitor? If I made a pull request, would you be likely to accept?

tabyl() doesn't work when called with lapply()

Want to be able to call tabyl() on many columns at once.

Compare lapply(mtcars, tabyl) to lapply(mtcars, table)`.

add vignette

get_dupes should use all variables when none are specified

Truncate group names on top_2

Ex: top_2(as.factor(mtcars$wt)) has enormously long 2nd row name for middle group because there are so many categories. Truncate at ~20 chars? And/or have it prefaced with "middle group:", or have it always be "Middle group (N categories)" where N is dynamic.

add clean_NA_variants function

Takes a data.frame, turns all instances of "N/A" and "#N/A" and "NA" into true NA values. Maybe it takes a parameter to either clean those exact terms, or grep them, filtering out say "N/A- I have not yet used or received this support/tool."

create get_not_dupes() function

Ran into this today, where we want to see if anyone w/ the same ID had specified different values for race columns. Used get_dupes, then looked at IDs that were not in the duplicated tables.

The use case is for cleaning data, when all records should have a duplicate. I'm not sure how to handle records where there's only one instance of the unique ID. Should it return all unique rows of the specified variables, and thus those? Then for this use case you'd have to start by filtering the table for records where the ID appears at least twice. Makes for a simpler function, but if you always have to pre-filter for it to be useful, maybe I should bake that in.

Let's start simple: takes a df and variable names, returns a df of the rows that didn't share those variable combinations. The opposite of get_dupes() which is nice.

leading whitespace in Ns on adorn_crosstab()

my call is yielding:

     webinar Began Training   Withdrew
1   Attended     72.2% (13) 27.8% ( 5)
2 Registered     25.0% ( 3) 75.0% ( 9)
3       <NA>     27.4% (29) 72.6% (77)

Changing paste_ns from n_matrix <- as.matrix(n_df)
to
n_matrix <- as.matrix(data.frame(lapply(n_df,as.character))) gets me this:

     webinar Began Training   Withdrew
1   Attended     72.2% (13)  27.8% (5)
2 Registered      25.0% (3)  75.0% (9)
3       <NA>     27.4% (29) 72.6% (77)

Which I don't like since it's crooked. I want:

     webinar Began Training   Withdrew
1   Attended     72.2% (13) 27.8%  (5)
2 Registered     25.0%  (3) 75.0%  (9)
3       <NA>     27.4% (29) 72.6% (77)

Basically, the spaces moved left to outside of the parentheses. Thinking I'll write a little regex replacement function to count the spaces after (, then replace the trailing spaces with preceding ones.

The tests that you have here are great but it'd be great to have examples against a nasty data file that looks similar to something that might be seen in the wild.

You can assign me this issue if you'd like 😄

crosstab.data.frame produces factor result, while crosstab.default produces character

Compare:

z_df <- crosstab(dat, v3, v1)
z <- crosstab(dat$v3, dat$v1)

Causing this test to fail:

test_that("crosstab.data.frame dispatches", {
  expect_equal(z_df,
               z %>% setNames(., c("v3", names(.)[-1]))) # compare to regular z above - they have different names[1] due to piping
})

Looks like it's due to needing stringsAsFactors = FALSE in crosstab.data.frame.

create compare_top_2 function

to create a table with top_2 agree values of different variables