moodymudskipper / cutr Goto Github PK

View Code? Open in Web Editor NEW

7.0 7.0 2.0 72 KB

Enhanced cut And Useful Related Functions

R 100.00%

cutr's People

Contributors

Stargazers

Watchers

Forkers

rgdicker jfontestad

cutr's Issues

group unfrequent values ?

Could we have an easy way to group unfrequent values ?

what == "n_by_group" deals with some of these cases, but not all.

One of these cases is to group all small unordered factors together.

Another cases is when we want to group everything after or before a hard threshold. threshold should be given, in absolute or relative terms.

If character of unordered factor, unfrequent items would be grouped.

In case of ordered or numeric they would be grouped when they follow each other. In practice it would be mainly used to aggregate the tails.

We could have new values of what :

"aggregate_bigger"
"aggregate_smaller"
"aggregate_rare"

we could consider that i between 0 and 1 (not included) means relative and i equal or superior to n is absolute, it's a bit awkward and can be dangerous but it has the advantage of not multiplying parameters.

The other options are having more choices but we already have a lot, or adding options.

cut3 with "groups" warns about duplicate levels

probably when less than the expected number of groups were created

be clearer about duplicate cut points

so far we take unique cutpoints with what = "group" but not for "breaks" or "quantiles", need to be consistent, so far also empty categories are shown with [), should be ()

center_fun to be remapped to 'labels'

when labels is character do as is now, when it's a function/ formula, execute it on the bin's content, so could be a center function, but could also be ~ paste(., collapse = "+") for factors for example.

benchmarks

compare speed with other methods (base, Hmisc, ggplot), and between different values of what leading to similar groups.

check that squeeze gets along fine with crop /expand

It probably doesn't at the moment

Check that crop doesn't mess up number of intervals

if my data starts in 1, and closed = "right" , and cuts are in 0,1, and whatever, what happened to my first interval ?

maybe add an exception for this case. simplify will take care of it, else same issue as the Inf situation in get_optimal_cuts

cutb and cut2b

additionally to cut3, have cutb and cut2b, or cutf and cut2f, or cut_format and cut_format2 which are just versions of cut and cut2 with format_fun argument.

more flexible simplify argument

simplify could work:

only for values that fall on cutpoints (like it does in cut2 despite what doc says)
on any point (like it does currently in cut3
any points AND crop neighbor intervals (could be the default)

They need names :

only_breaks
any
any_and_crop
NULL

hard cut

implement this alternative : https://stackoverflow.com/questions/5731116/equal-frequency-discretization-in-r

Function documentation

for cut3, signif2, get_cuts (should get a new name, breaks ?), format_metric, middle

Readme / Vignette

Show with examples, maybe better to illustrate most with charts ?

Would be nice if we could use only charts, but still convey that it's not a graphical package.

design a nice small bimodal dataset, with a few outliers to show the expand, crop, squeeze features etc...
design a bigger continuous dataset with big numbers and many digits to show the format_metric and smart_signif features
Show example with factors
show how using formatC is like using cut, and using format is like using cut2

factor feature should support breaks given as character/factor

more tests for unsorted x!

there might be some mistakes in the code, like taking x[1] instead of min(x)

weighted cuts ?

https://stackoverflow.com/questions/12512736/r-how-to-bin-weighted-data

need to document new output and what features

what = "bins" and output = "numeric" / "label"

feature smart_bin

all the beginning of the function should be wrapped into smart_bin, then with what = bins feature we'll be able to run both parts of the cut separately

Design test functions

functions like has_n_bins(n, call) / nth_label_is(ns, vals) / nth_value_is(ns, vals)

empty bins

handle the issue of empty bins, wether they are at the ends or in the middle. The levels should not be removed (users can use droplevels on the result), have a general approach.

This is a good starting case:

x <- c(rep(1,7),rep(2,5),3:6,17:20)
cuts <- c(-5,0,10, 15, 20, 25)
table(cut3(x, cuts))

refactor code

get_cuts needs to include the transformations on i and what that are in part in main function atm.

need functions like get_cuts_from_breaks so the code can breathe more

redesign arguments

optim_fun is specific to what = groups, so it shouldn't be there, should be the optional second element of i , that would then be inputed as a list.

width_* options should become simply width with i argument that could be a list and second argument a function/formula of .x = x and .y = width (first element of i` ). Result of this function would give first cutpoint, so comparing to current values functions would be :

width (default): ~ min(.x) -.y/2 or shortcut "default" (necessary)
width_min : ~ min(.) or shortcut "min"
width_max : ~ max(.) or shortcut "max"
width_0 : ~ 0

implement Frank Harrell's algorithm for groups

for completeness it would be nice, works better than naive quantile for many cases, and probably faster than cut3 optimization

Needs to be understood, isolated, and reversed to work for both values of closed

wrong repo

closed = "both" needs to be changed and documented

it's not well defined when cut points are given explicitly because then we also need the left/right args for when data falls on cutpoints.

crop is used for min and max already, maybe shrink ? crop_all ? squeeze ? I think I like squeeze

explicit what = "quantile" ?

same as quantile(x,c(...)) but slightly less verbose and more readable ?

Would deal with duplicate breaks

see package classInt

https://cran.r-project.org/web/packages/classInt/classInt.pdf

There are some things to take from there and some others not to take.

"fixed" : our default breaks
pretty /sd : pretty is nice but it just creates breaks, so can be used with i = pretty(x) with default what , sd is apparently pretty used on centered/scaled data, so same thing
"equal" is a better name than "n_interval", we should use it as an alias and throw away n_interval after some time
"quantile" : we have it, but should add examples of different quantiles distribution using the functions directly in breaks
"kmeans", "hclust", "bclust", "fisher", or "jenks" : clustering algorithms, some are in base some will require some research, maybe inspect code from classInt (check liscence)

close this task when those issues are assigned as their own issues

feature cutate

cut is a slightly annoying to use in pipe chains :

df %>%
  mutate(my_variable1 = cut(my_variable1...) %>% 
  mutate(my_variable2 = cut(my_variable2...)

df %>%
  mutate_at(vars(my_variable1, my_variable2), cut, ... )

Instead we can have

df %>%
  cutate(vars(my_variable1, my_variable2), ... )

a little more readable

Design tests

integrate optimal cuts feature

through optim_fun

See alternatives : https://stackoverflow.com/questions/6104836/splitting-a-continuous-variable-into-equal-sized-groups

Review this function to use .binData instead of cut, use fun should have 2 parameters, bin sizes and cut points.

squeeze bugs with brackets = NULL

brackets = NULL is handy so let's make it work

squeeze is bugged and needs more tests

the 2 firsts work, the latter don't :

x <- c(rep(1,3),rep(2,2),3:6,17:20)

table(smart_cut(x,c(0,10,30),squeeze = TRUE))
# 
#   [1,6] [17,20] 
#       9       4 

table(smart_cut(x,c(0,1.5,30),squeeze = TRUE))
# 
#      1 [2,20] 
#      3     10 

table(smart_cut(x,c(0,1,30),squeeze = TRUE))
# 
# (1,NA) [1,20] 
#      0     13 

table(smart_cut(x,c(0,1.5,10,30),squeeze = TRUE))
# 
#        1 [1.5,10]  [17,20] 
#        3        6        4

smart_cut(1:6,list(2,0),"width",open_end = F)
smart_cut(1:6,list(2,0),"width",open_end = T)

Make sure that all functions support formula notation

output type

We could have a parameter output = "ordered", that could can be also "charactor" or "factor".