sfirke / janitor Goto Github PK
View Code? Open in Web Editor NEWsimple tools for data cleaning in R
Home Page: http://sfirke.github.io/janitor/
License: Other
simple tools for data cleaning in R
Home Page: http://sfirke.github.io/janitor/
License: Other
It's a useful data point for annotations and for working with the numbers by hand, e.g., during a check
j <- structure(c(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE), .Names = c("104",
"114", "116", "120", "124", "126", "133", "144", "146", "149",
"156", "160", "163", "167", "173", "174", "177", "181", "186"
))
j %>% tabyl
104 n percent
1 TRUE 19 1
Warning message:
In names(result)[1] <- var_name :
number of items to replace is not a multiple of replacement length
Particularly nice when calling from top_levels
would be swell for SurveyMonkey -> SPSS -> R workflow, to know that q0038_0001
is the question you think it is.
Something like this at the start of the function:
# print label attribute, if it exists - does not work
if(!is.null(attr(dat %>% select(...), "label"))) {print(attr(dat %>% select(...), "label"))}
Maybe tabyl() - either the general function or one specialized for SPSS work - should take a labelled
class vector and work with that, rather than a data.frame.
This might be better off as a separate function. The idea would be that mtcars %>% tabyl(cyl, gear)
returns the result of:
full_join(tabyl(mtcars$cyl), tabyl(mtcars$gear), by = c("mtcars_cyl" = "mtcars_gear")) %>%
setNames(c("value", "cyl_n", "cyl_percent", "gear_n", "gear_percent"))
# A tibble: 5 x 5
value cyl_n cyl_percent gear_n gear_percent
<dbl> <int> <dbl> <int> <dbl>
1 4 11 0.34375 12 0.37500
2 6 7 0.21875 NA NA
3 8 14 0.43750 NA NA
4 3 NA NA 15 0.46875
5 5 NA NA 5 0.15625
Would it return n? %? Both? Have that be user-specified? If just n, it could take advantage of the adorn_crosstab
function I'm working on.
I have the variable name as a parameter ...
so that, when calling on a data.frame, the variable name doesn't need to be quoted and so autocomplete works in RStudio, e.g., iris %>% tabyl(Sep
will bring up autocomplete suggestions.
But this SO answer states:
It's not a good idea to use the ... when you know each parameter in advance, however, as it adds some ambiguity and further complication to the argument string (and makes the function signature unclear to any other user).
So maybe I'm using it unnecessarily?
with percentages row-wise, col-wise, total, or neither.
If it can extend the current tabyl without making it too complex, great.
Otherwise have it take two vectors instead of a data.frame?
The way table()
does? Seems like a useful option. Would be nice for survey analysis. Lower priority, though.
tabyl(mtcars$mpg)
Error: can't arrange by a matrix
crosstab()
codeIdea from @chrishaid
by creating another function. Here is a crude mockup:
crosstab_df <- function(dat, ...){
names <- as.list(substitute(list(...)))[-1L]
names <- unlist(lapply(names, deparse))
trimmed <- dat %>% select_(.dots = names)
crosstab(trimmed[[1]], trimmed[[2]])
}
Which works: mtcars %>% crosstab_df(cyl, am)
(sort of, variable name is lost in the result)
Do this for tabyl()
too as tabyl_df()
, right now it's tedious to have a dplyr pipeline with a filter, etc. that I have to interrupt to use tabyl() (or use use_series()
which isn't even an option for crosstab()
).
tabyl(survey$q0038_0001)
fails
So that all rows are visible when it prints to console.. As much as I love tibbles, for exploratory commands like this it's annoying not to see the bottom of the list.
it's called "new_vec" no matter what the input variable is called.
Ex:
top_2(as.factor(mtcars$wt)) %>% View
Right now it's middle ground - you can't call it, but you can pull its help page with ?clean_NA_vec. I think it should be exported, in case someone only desires to clean a single vector.
For instance tabyl(nchar(combined$
Associate's Year Awarded))
yields:
nchar(combined_`Associate's Year Awarded`) n percent valid_percent
<int> <int> <dbl> <dbl>
1 1 1 8.212203e-05 0.01388889
2 3 9 7.390983e-04 0.12500000
3 4 27 2.217295e-03 0.37500000
4 5 29 2.381539e-03 0.40277778
5 9 1 8.212203e-05 0.01388889
6 11 2 1.642441e-04 0.02777778
IMO that should be combined_Associates_Year_Awarded for easier subsequent reference
here's my use case - I am writing functions to suck up n state export files.
I can't depend on read.csv
's type hinting because I have situations where the nth data file will contain a data type (say, 'K' for grade, where grade had always been integer on the first 10 files) that doesn't play nicely.
To solve that, I'm reading the raw files in as character, and then doing type conversion myself.
But this seems like a janitor kind of job
Is this in scope/ out of scope? Any thoughts about how to move forward? @chrishaid would love your thoughts here as well
Like #33, but with tabyl().
mtcars %>% get_dupes(., wt, TRUE)
Error: incorrect size (1), expecting : 5
Is a terrible error message, albeit on a rare case.
More common is:
get_dupes(mtcars, wt, cyll):
Error: unknown column 'cyll'
But I'm relying on later functions to throw that.
right now it truncates for the middle category, which is most likely to need it. But add a function to truncate for top and bottom levels - maybe at a greater character limit.
For saying, use A if not NA, otherwise use B if not NA, otherwise... could just be a big case_when call (edit: case_when is probably not right given variable # of args, for loop may be simplest). Optionally create a new variable indicating which one was used.
Besides being more readable than nested ifelse, it would work for dates, which is vexing af if you run into this: http://stackoverflow.com/questions/6668963/how-to-prevent-ifelse-from-turning-date-objects-into-numeric-objects
In fact, have it print a warning if you feed it at least one input vector with class Date, suggesting that they specify force_class = "date
.
Not useful for subsequent analysis but handy for examining, and sometimes for sharing. So something like specifying percent = "row", mode = "combined"
and it would have in each cell a value like 10.4% (31)
.
Don't want to excessively complicate things, and it raises the issue of truncation preferences. Having parameters for percent
, mode
, and digits
seems excessive.
I could have a separate function that handles all % related calculations, like:
crosstab(mtcars$cyl, mtcars$gear) %>% table_percentages(., percent = "row", show = "combined", digits = 1)
Though I don't like taking the simple percentage option out of crosstab()
. But if I don't, there's redundancy.
Might just write this as a personal function. But I have to imagine it's widely used.
Again, mimicking Excel PivotTables.
I'm most of the way there, but unsure on:
So per that 2nd bullet, I think I need to switch when the totals col or row is displayed. And maybe stick with just displaying one by default, the pure 100%s one in Excel is not useful.
for easy knitting
crosstab(survey$q0002, survey$q0038_0001)
gives 1, 2, 3
get_dupes(mtcars, mpg
gives an error if dplyr is not loaded:
Do I have to call magrittr::%>% ? or can/should I make it load dplyr - is that against CRAN principles?
x <- factor(c("aaaaaaaa", "bbbbbbbb", "ccccccccddddddddd", "dddddddd", NA, "hhhhhhhh", "bbbbbbbb"), levels = c("dddddddd", "aaaaaaaa", "ccccccccddddddddd", "bbbbbbbb", "hhhhhhhh"))
tabyl(x, sort = TRUE)
Then don't need to worry about topics besides agreement - "very confident" etc. Have it spit out top-2, bottom-2, and middle.
mtcars %>% crosstab(cyl, cyl)
throws error:
Error in
[.data.frame
(x, 2) : undefined columns selected
But you can do crosstab(mtcars$cyl, mtcars$cyl)
and frankly, I don't think this is a meaningful operation. Use tabyl()
instead. Not sure it's worth producing a more helpful error message.
Can we somehow disable feeding lists as arguments to crosstabs? Not a huge deal in practice, as it's not an intended use, but I'd rather it error than produce a nonsensical result. Look into it.
I'm not sure if this is acceptable or not. It gives you a legal variable name in the result, useful for further operations. But it's not as readable, and doesn't make for nice direct presentation, say in knitr::kable()
.
Only affects calls where there's an unrepresented level in a factor variable, so this is a niche case.
Instead of:
> tabyl(sorted_with_na_and_fac$grp, sort = TRUE)
sorted_with_na_and_fac$grp n percent valid_percent
1 c 2 0.50 0.6666667
2 a 1 0.25 0.3333333
3 b NA NA NA
4 <NA> 1 0.25 NA
Return:
> tabyl(sorted_with_na_and_fac$grp, sort = TRUE)
sorted_with_na_and_fac$grp n percent valid_percent
1 c 2 0.50 0.6666667
2 a 1 0.25 0.3333333
3 b 0 0.00 0.0000000
4 <NA> 1 0.25 0.0000000
Could likely be done with:
# replace NA values with zeroes
result[is.na(result)] <- 0
Though I think I'd want to retain the NA representation of row = NA and column = valid_percent, to show that missing values is not in fact 0% once they've been filtered out.
the variable name employee's name
should become employees_name
not employee_s_name
... a small thing, but the latter is annoying to type
To give users more control. Maybe a parameter common = TRUE
?
I wanted to do this:
mtcars %>% tabyl(mpg) %>% tabyl(n)
But got error message:
Error: found duplicated column name: n
Compare to:
mtcars %>% count(mpg) %>% count(n)
instead of top_2(factor_var)
it should be top_levels(factor_var, lvls = 2)
Sam, one thing that I've run into recently is output of other packages (especially the stuff that reads in Access files... ugh) that doesn't give fine control to the read.csv parameters - I've been having to take the data frame as it comes, and then flip factors back to character.
Is that something that should live in janitor? If I made a pull request, would you be likely to accept?
Want to be able to call tabyl()
on many columns at once.
Compare lapply(mtcars, tabyl)
to lapply(mtcars, table)`.
Ex: top_2(as.factor(mtcars$wt))
has enormously long 2nd row name for middle group because there are so many categories. Truncate at ~20 chars? And/or have it prefaced with "middle group:", or have it always be "Middle group (N categories)" where N is dynamic.
Takes a data.frame, turns all instances of "N/A" and "#N/A" and "NA" into true NA values. Maybe it takes a parameter to either clean those exact terms, or grep them, filtering out say "N/A- I have not yet used or received this support/tool."
Ran into this today, where we want to see if anyone w/ the same ID had specified different values for race columns. Used get_dupes
, then looked at IDs that were not in the duplicated tables.
The use case is for cleaning data, when all records should have a duplicate. I'm not sure how to handle records where there's only one instance of the unique ID. Should it return all unique rows of the specified variables, and thus those? Then for this use case you'd have to start by filtering the table for records where the ID appears at least twice. Makes for a simpler function, but if you always have to pre-filter for it to be useful, maybe I should bake that in.
Let's start simple: takes a df and variable names, returns a df of the rows that didn't share those variable combinations. The opposite of get_dupes()
which is nice.
my call is yielding:
webinar Began Training Withdrew
1 Attended 72.2% (13) 27.8% ( 5)
2 Registered 25.0% ( 3) 75.0% ( 9)
3 <NA> 27.4% (29) 72.6% (77)
Changing paste_ns
from n_matrix <- as.matrix(n_df)
to
n_matrix <- as.matrix(data.frame(lapply(n_df,as.character)))
gets me this:
webinar Began Training Withdrew
1 Attended 72.2% (13) 27.8% (5)
2 Registered 25.0% (3) 75.0% (9)
3 <NA> 27.4% (29) 72.6% (77)
Which I don't like since it's crooked. I want:
webinar Began Training Withdrew
1 Attended 72.2% (13) 27.8% (5)
2 Registered 25.0% (3) 75.0% (9)
3 <NA> 27.4% (29) 72.6% (77)
Basically, the spaces moved left to outside of the parentheses. Thinking I'll write a little regex replacement function to count the spaces after (
, then replace the trailing spaces with preceding ones.
What's important is that the input is the digits like 41883 - okay if it's stored as a character.
Suggested by @ffirke.
this really should exist - this paper links to dirty_iris
, available on github
That is somewhat dirty, but it doesn't have a lot of the 'in the wild' features that I see all the time (off the top of my head: data that is supposed to be numeric but has a NA / missing code that R interprets as character).
The tests that you have here are great but it'd be great to have examples against a nasty data file that looks similar to something that might be seen in the wild.
You can assign me this issue if you'd like ๐
Compare:
z_df <- crosstab(dat, v3, v1)
z <- crosstab(dat$v3, dat$v1)
Causing this test to fail:
test_that("crosstab.data.frame dispatches", {
expect_equal(z_df,
z %>% setNames(., c("v3", names(.)[-1]))) # compare to regular z above - they have different names[1] due to piping
})
Looks like it's due to needing stringsAsFactors = FALSE
in crosstab.data.frame
.
to create a table with top_2 agree values of different variables
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.