moodymudskipper / cutr Goto Github PK
View Code? Open in Web Editor NEWEnhanced cut And Useful Related Functions
Enhanced cut And Useful Related Functions
Could we have an easy way to group unfrequent values ?
what == "n_by_group"
deals with some of these cases, but not all.
One of these cases is to group all small unordered factors together.
Another cases is when we want to group everything after or before a hard threshold. threshold should be given, in absolute or relative terms.
If character of unordered factor, unfrequent items would be grouped.
In case of ordered or numeric they would be grouped when they follow each other. In practice it would be mainly used to aggregate the tails.
We could have new values of what
:
"aggregate_bigger"
"aggregate_smaller"
"aggregate_rare"
we could consider that i
between 0 and 1 (not included) means relative and i
equal or superior to n is absolute, it's a bit awkward and can be dangerous but it has the advantage of not multiplying parameters.
The other options are having more choices but we already have a lot, or adding options.
probably when less than the expected number of groups were created
so far we take unique cutpoints with what = "group" but not for "breaks" or "quantiles", need to be consistent, so far also empty categories are shown with [)
, should be ()
when labels is character do as is now, when it's a function/ formula, execute it on the bin's content, so could be a center function, but could also be ~ paste(., collapse = "+")
for factors for example.
compare speed with other methods (base, Hmisc, ggplot), and between different values of what
leading to similar groups.
It probably doesn't at the moment
if my data starts in 1
, and closed = "right"
, and cuts are in 0,1, and whatever, what happened to my first interval ?
maybe add an exception for this case. simplify will take care of it, else same issue as the Inf
situation in get_optimal_cuts
additionally to cut3
, have cutb
and cut2b
, or cutf
and cut2f
, or cut_format
and cut_format2
which are just versions of cut
and cut2
with format_fun
argument.
simplify
could work:
cut3
They need names :
implement this alternative : https://stackoverflow.com/questions/5731116/equal-frequency-discretization-in-r
for cut3
, signif2
, get_cuts
(should get a new name, breaks
?), format_metric
, middle
Show with examples, maybe better to illustrate most with charts ?
Would be nice if we could use only charts, but still convey that it's not a graphical package.
design a nice small bimodal dataset, with a few outliers to show the expand, crop, squeeze features etc...
design a bigger continuous dataset with big numbers and many digits to show the format_metric
and smart_signif
features
Show example with factors
show how using formatC is like using cut, and using format is like using cut2
there might be some mistakes in the code, like taking x[1] instead of min(x)
what = "bins" and output = "numeric" / "label"
all the beginning of the function should be wrapped into smart_bin
, then with what = bins
feature we'll be able to run both parts of the cut separately
functions like has_n_bins(n, call)
/ nth_label_is(ns, vals)
/ nth_value_is(ns, vals)
handle the issue of empty bins, wether they are at the ends or in the middle. The levels should not be removed (users can use droplevels on the result), have a general approach.
This is a good starting case:
x <- c(rep(1,7),rep(2,5),3:6,17:20)
cuts <- c(-5,0,10, 15, 20, 25)
table(cut3(x, cuts))
get_cuts needs to include the transformations on i and what that are in part in main function atm.
need functions like get_cuts_from_breaks so the code can breathe more
optim_fun is specific to what
= groups
, so it shouldn't be there, should be the optional second element of i , that would then be inputed as a list.
width_*
options should become simply width
with i
argument that could be a list and second argument a function/formula of .x = x
and .y = width
(first element of
i` ). Result of this function would give first cutpoint, so comparing to current values functions would be :
width (default): ~ min(.x) -.y/2 or shortcut "default" (necessary)
width_min
: ~ min(.)
or shortcut "min"
width_max : ~ max(.)
or shortcut "max"
width_0 : ~ 0
for completeness it would be nice, works better than naive quantile for many cases, and probably faster than cut3 optimization
Needs to be understood, isolated, and reversed to work for both values of closed
it's not well defined when cut points are given explicitly because then we also need the left
/right
args for when data falls on cutpoints.
crop
is used for min and max already, maybe shrink
? crop_all
? squeeze
? I think I like squeeze
same as quantile(x,c(...)) but slightly less verbose and more readable ?
Would deal with duplicate breaks
https://cran.r-project.org/web/packages/classInt/classInt.pdf
There are some things to take from there and some others not to take.
breaks
pretty
is nice but it just creates breaks, so can be used with i = pretty(x)
with default what
, sd
is apparently pretty
used on centered/scaled data, so same thingclose this task when those issues are assigned as their own issues
cut is a slightly annoying to use in pipe chains :
df %>%
mutate(my_variable1 = cut(my_variable1...) %>%
mutate(my_variable2 = cut(my_variable2...)
Or
df %>%
mutate_at(vars(my_variable1, my_variable2), cut, ... )
Instead we can have
df %>%
cutate(vars(my_variable1, my_variable2), ... )
a little more readable
through optim_fun
See alternatives : https://stackoverflow.com/questions/6104836/splitting-a-continuous-variable-into-equal-sized-groups
Review this function to use .binData
instead of cut
, use fun should have 2 parameters, bin sizes and cut points.
brackets = NULL is handy so let's make it work
the 2 firsts work, the latter don't :
x <- c(rep(1,3),rep(2,2),3:6,17:20)
table(smart_cut(x,c(0,10,30),squeeze = TRUE))
#
# [1,6] [17,20]
# 9 4
table(smart_cut(x,c(0,1.5,30),squeeze = TRUE))
#
# 1 [2,20]
# 3 10
table(smart_cut(x,c(0,1,30),squeeze = TRUE))
#
# (1,NA) [1,20]
# 0 13
table(smart_cut(x,c(0,1.5,10,30),squeeze = TRUE))
#
# 1 [1.5,10] [17,20]
# 3 6 4
The squeeze feature will be nice for those, so let's code it first.
Unused level might not be an issue, but pay attention.
it's not clear and doesn't handle edge cases, also intro is unnecessary.
Simplify get_optimal_cutpoints
by removing purrr stuff and getNamespace, show examples, then show how it would work with the package.
idea is good and might be handy for other packages, but doesn't play so well with interface, as we map conditionally and not many functions. Maybe just get rid of it and do simple tests + as_function. Will be easier to read
It doesn't use any recent R feature
if what = "membership"
i must be a vector of same length as x with only one i value for each i and i sorted by x must be sorted.
This will permit to use any clustering/grouping method unsupported by cutr
, and leverage the labelling features.
It says "keep open side open", that's not explicit enough. what it does is include last point on open side or not IF it falls on a cut point, in particular with the default expand = TRUE, extremities will always be closed :
smart_cut(1:6,list(2,0),"width",open_end = F)
smart_cut(1:6,list(2,0),"width",open_end = T)
We could have a parameter output = "ordered"
, that could can be also "charactor" or "factor".
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.