jonmcalder / refactor Goto Github PK
View Code? Open in Web Editor NEW:bookmark: Better factor handling for R
Home Page: https://jonmcalder.github.io/refactor/
License: MIT License
:bookmark: Better factor handling for R
Home Page: https://jonmcalder.github.io/refactor/
License: MIT License
The following behaviour of cfactor is undesirable:
cfactor(c("a", "b"), levels = c("a", "a", "b"))
yields a warning message that duplicated factor levels are depreciated. However, they are created anyways. This should not happen. Instead, before factor
is called, we should remove duplicates from levels
and issue a warning, saying duplicate levels were removed.
x
, labels
and levels
It is actually possible to do the following without getting a warning:
factor(letters, levels = letters, labels = sample(letters))
. Here, we essentially map a value x_i to an arbitrary value y_i with y_i is an element of X = (x_1, ... x_n). For most situations, this is probably not what the user wanted. It should at least cause a warning.
Since quantile
is used internally, cut.ordered
cannot yet account for the right
argument that exists in cut.default
and cut.integer
Should we create a function as.cfactor
that just calls as.factor
for consistency and completeness?
If length(breaks) == 1
, the option right
remains without effect, whereas in the case of length(breaks) > 1
, the option balance
remains without effect but no message is issued.
Since both options refer to more or less the same (if not exactly the same), we should only keep one.
cut.integer
with breaks
ofc(1, 3.4, 6, 7, 10)
creates levels 1-3.4 4.4-6 7-10
.
breaks need to be coerced to integers giving a warning if any decimal numbers were rounded and the resulting breaks should be displayed.
The breaks_mode is designated to resemble cut.default
, however, the right argument remains without effect in cut.integer
for default breaks_mode and breaks as scalars as shown below. First, look at cut.default
:
cut.default(1:4, 3, right = T)
[1] (0.997,2] (0.997,2] (2,3] (3,4]
Levels: (0.997,2] (2,3] (3,4]
cut.default(1:4, 3, right = F)
[1] [0.997,2) [2,3) [3,4) [3,4)
Levels: [0.997,2) [2,3) [3,4)
The second element of the output is first part of the first bin with right = TRUE
, then part of the second bin with right = FALSE
. As you stated in the documentation, the implementation of cut.integer
does not consider the right argument in this case.
cut(1:4, 3, right = F)
[1] 1-2 1-2 3 4
Levels: 1-2 3 4
cut(1:4, 3, right = T)
[1] 1-2 1-2 3 4
Levels: 1-2 3 4
Since we name the breaks_mode
"default", cut.integer
needs to reproduce the underlying output (not the labels obviously) of cut.default
.
cut(sample(2), breaks = 3)
andcut(1L, breaks = 2)
cut.default
which might not be immediately clear to the user.It should be possible to quickly see which level is mapped to which integer and what labels where used when it was created. This could be achieved in two steps:
Depending on how breaks are specified, lowest values are included or not.
cut(sample(10), breaks = 2)
gives the levels 1-5 and 6-10 (because the default is overwritten internally on line 78, whilecut(sample(10), breaks = c(1, 5, 10)
gives the levels 2-5 and 6-10.Probably due to the default in cut.default, you wanted to keep include.lowest = FALSE. I think it is a counter intuitive default, so let's do away with it and change it to TRUE and hence remove line 78.
When breaks
are specified to closely together, bins of width one with ugly labels result, i.e. from cut(sample(10), breaks = c(1, 4, 6, 8, 9, 10))
, we get levels 1-4 5-6 7-8 9-9 10-10
10
) instead of creating a standard label (e.g. 10-10
).Came across this on StackOverflow and it offers around a 5-6x speedup on cut.default, so I figure why not make use of it? (or something similar)
Not sure whether it makes more sense to display where the error / warning occurred or not. I guess in interactive usage, it is pointless, but if you wrap for example cfactor
in another function, it might be helpful?
reprex::reprex_info()
#> Created by the reprex package v0.1.1.9000 on 2017-10-16
library(refactor); cut(c(sample(10), NA, NA), breaks = 3)
#>
#> Attache Paket: 'refactor'
#> The following objects are masked from 'package:base':
#>
#> append, ifelse
#> Error in if (breaks > max(x) - min(x) + 1) {: Fehlender Wert, wo TRUE/FALSE nötig ist
The solution of binwidth one is only working for breaks specified as breakpoints.
Currently, the two cut methods for integer and ordered inherit from the default method regarding whether or not the output should be an ordered factor. In the helpfile, it is stated that
Logical: should the result be an ordered factor?
This seems to be a strange behavior since by nature, we are dealing with an sequence of ordered values, otherwise cut could not have been applied.
Hence, I suggest to change the default to TRUE
and document this in the help file under details.
Because of the assertive test in cut.integer
for breaks
, something like cut(sample(10), breaks = c(1L, 3L, 10L)
fails.
Use assert(check_class(breaks, "integer"), check_class(breaks, "numeric"))
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.