Giter VIP home page Giter VIP logo

cutr's People

Contributors

moodymudskipper avatar rgdicker avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

cutr's Issues

group unfrequent values ?

Could we have an easy way to group unfrequent values ?

what == "n_by_group" deals with some of these cases, but not all.

One of these cases is to group all small unordered factors together.

Another cases is when we want to group everything after or before a hard threshold. threshold should be given, in absolute or relative terms.

If character of unordered factor, unfrequent items would be grouped.

In case of ordered or numeric they would be grouped when they follow each other. In practice it would be mainly used to aggregate the tails.

We could have new values of what :

  • "aggregate_bigger"
  • "aggregate_smaller"
  • "aggregate_rare"

we could consider that i between 0 and 1 (not included) means relative and i equal or superior to n is absolute, it's a bit awkward and can be dangerous but it has the advantage of not multiplying parameters.

The other options are having more choices but we already have a lot, or adding options.

be clearer about duplicate cut points

so far we take unique cutpoints with what = "group" but not for "breaks" or "quantiles", need to be consistent, so far also empty categories are shown with [), should be ()

center_fun to be remapped to 'labels'

when labels is character do as is now, when it's a function/ formula, execute it on the bin's content, so could be a center function, but could also be ~ paste(., collapse = "+") for factors for example.

benchmarks

compare speed with other methods (base, Hmisc, ggplot), and between different values of what leading to similar groups.

Check that crop doesn't mess up number of intervals

if my data starts in 1, and closed = "right" , and cuts are in 0,1, and whatever, what happened to my first interval ?

maybe add an exception for this case. simplify will take care of it, else same issue as the Inf situation in get_optimal_cuts

cutb and cut2b

additionally to cut3, have cutb and cut2b, or cutf and cut2f, or cut_format and cut_format2 which are just versions of cut and cut2 with format_fun argument.

more flexible simplify argument

simplify could work:

  • only for values that fall on cutpoints (like it does in cut2 despite what doc says)
  • on any point (like it does currently in cut3
  • any points AND crop neighbor intervals (could be the default)

They need names :

  • only_breaks
  • any
  • any_and_crop
  • NULL

Readme / Vignette

Show with examples, maybe better to illustrate most with charts ?

Would be nice if we could use only charts, but still convey that it's not a graphical package.

  • design a nice small bimodal dataset, with a few outliers to show the expand, crop, squeeze features etc...

  • design a bigger continuous dataset with big numbers and many digits to show the format_metric and smart_signif features

  • Show example with factors

  • show how using formatC is like using cut, and using format is like using cut2

feature smart_bin

all the beginning of the function should be wrapped into smart_bin, then with what = bins feature we'll be able to run both parts of the cut separately

Design test functions

functions like has_n_bins(n, call) / nth_label_is(ns, vals) / nth_value_is(ns, vals)

empty bins

handle the issue of empty bins, wether they are at the ends or in the middle. The levels should not be removed (users can use droplevels on the result), have a general approach.

This is a good starting case:

x <- c(rep(1,7),rep(2,5),3:6,17:20)
cuts <- c(-5,0,10, 15, 20, 25)
table(cut3(x, cuts))

refactor code

get_cuts needs to include the transformations on i and what that are in part in main function atm.

need functions like get_cuts_from_breaks so the code can breathe more

redesign arguments

optim_fun is specific to what = groups, so it shouldn't be there, should be the optional second element of i , that would then be inputed as a list.

width_* options should become simply width with i argument that could be a list and second argument a function/formula of .x = x and .y = width (first element of i` ). Result of this function would give first cutpoint, so comparing to current values functions would be :

width (default): ~ min(.x) -.y/2 or shortcut "default" (necessary)
width_min : ~ min(.) or shortcut "min"
width_max : ~ max(.) or shortcut "max"
width_0 : ~ 0

implement Frank Harrell's algorithm for groups

for completeness it would be nice, works better than naive quantile for many cases, and probably faster than cut3 optimization

Needs to be understood, isolated, and reversed to work for both values of closed

closed = "both" needs to be changed and documented

it's not well defined when cut points are given explicitly because then we also need the left/right args for when data falls on cutpoints.

crop is used for min and max already, maybe shrink ? crop_all ? squeeze ? I think I like squeeze

see package classInt

https://cran.r-project.org/web/packages/classInt/classInt.pdf

There are some things to take from there and some others not to take.

  • "fixed" : our default breaks
  • pretty /sd : pretty is nice but it just creates breaks, so can be used with i = pretty(x) with default what , sd is apparently pretty used on centered/scaled data, so same thing
  • "equal" is a better name than "n_interval", we should use it as an alias and throw away n_interval after some time
  • "quantile" : we have it, but should add examples of different quantiles distribution using the functions directly in breaks
  • "kmeans", "hclust", "bclust", "fisher", or "jenks" : clustering algorithms, some are in base some will require some research, maybe inspect code from classInt (check liscence)

close this task when those issues are assigned as their own issues

feature cutate

cut is a slightly annoying to use in pipe chains :

df %>%
  mutate(my_variable1 = cut(my_variable1...) %>% 
  mutate(my_variable2 = cut(my_variable2...) 

Or

df %>%
  mutate_at(vars(my_variable1, my_variable2), cut, ... )

Instead we can have

df %>%
  cutate(vars(my_variable1, my_variable2), ... )

a little more readable

squeeze is bugged and needs more tests

the 2 firsts work, the latter don't :

x <- c(rep(1,3),rep(2,2),3:6,17:20)

table(smart_cut(x,c(0,10,30),squeeze = TRUE))
# 
#   [1,6] [17,20] 
#       9       4 

table(smart_cut(x,c(0,1.5,30),squeeze = TRUE))
# 
#      1 [2,20] 
#      3     10 

table(smart_cut(x,c(0,1,30),squeeze = TRUE))
# 
# (1,NA) [1,20] 
#      0     13 

table(smart_cut(x,c(0,1.5,10,30),squeeze = TRUE))
# 
#        1 [1.5,10]  [17,20] 
#        3        6        4 

Support ordered factors

The squeeze feature will be nice for those, so let's code it first.

Unused level might not be an issue, but pay attention.

rework SO post

it's not clear and doesn't handle edge cases, also intro is unnecessary.

Simplify get_optimal_cutpoints by removing purrr stuff and getNamespace, show examples, then show how it would work with the package.

not sure if set_mappers was so smart

idea is good and might be handy for other packages, but doesn't play so well with interface, as we map conditionally and not many functions. Maybe just get rid of it and do simple tests + as_function. Will be easier to read

feature : membership

if what = "membership" i must be a vector of same length as x with only one i value for each i and i sorted by x must be sorted.

This will permit to use any clustering/grouping method unsupported by cutr , and leverage the labelling features.

open_end doc should be more clear

It says "keep open side open", that's not explicit enough. what it does is include last point on open side or not IF it falls on a cut point, in particular with the default expand = TRUE, extremities will always be closed :

smart_cut(1:6,list(2,0),"width",open_end = F)
smart_cut(1:6,list(2,0),"width",open_end = T)

output type

We could have a parameter output = "ordered", that could can be also "charactor" or "factor".

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.