Giter VIP home page Giter VIP logo

ludvigolsen / groupdata2 Goto Github PK

View Code? Open in Web Editor NEW
27.0 4.0 3.0 1.88 MB

R-package: Methods for dividing data into groups. Create balanced partitions and cross-validation folds. Perform time series windowing and general grouping and splitting of data. Balance existing groups with up- and downsampling or collapse them to fewer groups.

License: Other

R 99.98% CSS 0.02%
group-factor cross-validation fold staircase balance groups participants data-frame split data

groupdata2's Issues

Package failing with devel version of checkmate

I'm preparing a new release of checkmate, and some minor changes in checkmate are causing your unit tests to break. AFAICT, tests fail because some error message have slightly changed:

  ── 1. Failure: fuzz testing input checks for differs_from_previous() (@test_diff
  `xpectr::strip_msg(...)` threw an error with unexpected message.
  Expected match: "Assertion failed One of the following must apply checkmatecheckstringhandlena Must be of type string not factor checkmatechecknumberhandlena Must be of type number not factor"
  Actual message: "Assertion on handlena failed One of the following must apply checkmatecheckstringhandlena Must be of type string not factor checkmatechecknumberhandlena Must be of type number not factor"
  Backtrace:
   1. testthat::expect_error(...)
   6. xpectr::strip_msg(...)
   7. xpectr::stop_if(...)
  
  ── 2. Failure: fuzz testing input checks for differs_from_previous() (@test_diff
  `xpectr::strip_msg(...)` threw an error with unexpected message.
  Expected match: "Assertion failed One of the following must apply checkmatecheckstringhandlena Must be of type string not NULL checkmatechecknumberhandlena Must be of type number not NULL"
  Actual message: "Assertion on handlena failed One of the following must apply checkmatecheckstringhandlena Must be of type string not NULL checkmatechecknumberhandlena Must be of type number not NULL"
  Backtrace:
   1. testthat::expect_error(...)
   6. xpectr::strip_msg(...)
   7. xpectr::stop_if(...)
  
  ── 3. Failure: find_missing_starts() find the right missing starts (@test_find_m
  `xpectr::strip_msg(...)` threw an error with unexpected message.
  Expected match: "Assertion failed One of the following must apply checkmatechecknumericn Must be of type numeric not dataframe checkmatecheckcharactern Must be of type character not dataframe checkmatechecklistn Must be of type list not dataframe"
  Actual message: "Assertion on n failed One of the following must apply checkmatechecknumericn Must be of type numeric not dataframe checkmatecheckcharactern Must be of type character not dataframe checkmatechecklistn Must be of type list not dataframe"
  Backtrace:
   1. testthat::expect_error(...)
   6. xpectr::strip_msg(...)
   7. xpectr::stop_if(...)
  
  ── 4. Failure: fuzz testing input checks for l_starts method in group_factor() (
  `xpectr::strip_msg(...)` threw an error with unexpected message.
  Expected match: "Assertion failed One of the following must apply checkmatecheckstringstartscol May not be NA checkmatecheckcountstartscol May not be NA"
  Actual message: "Assertion on startscol failed One of the following must apply checkmatecheckstringstartscol May not be NA checkmatecheckcountstartscol May not be NA"
  Backtrace:
   1. testthat::expect_error(...)
   6. xpectr::strip_msg(...)
   7. xpectr::stop_if(...)
  
  ══ testthat results  ═══════════════════════════════════════════════════════════
  [ OK: 1378 | SKIPPED: 4 | WARNINGS: 0 | FAILED: 4 ]
  1. Failure: fuzz testing input checks for differs_from_previous() (@test_differs_from_previous.R#640) 
  2. Failure: fuzz testing input checks for differs_from_previous() (@test_differs_from_previous.R#652) 
  3. Failure: find_missing_starts() find the right missing starts (@test_find_missing_starts.R#217) 
  4. Failure: fuzz testing input checks for l_starts method in group_factor() (@test_group_factor.R#1974) 
  
  Error: testthat unit tests failed
  Execution halted

Can you relax the tests so that the current devel version of checkmate passes your tests again?

Fix implementation of multiple unique fold columns for repeated cross-validation

TODO for the new functionality (quick implementation) in fold(), intended for repeated cross-validation (repeatedCV branch):

  1. When detecting identical fold columns it repeats column comparisons in secondary iterations. This is unnecessary. Also, create tests to see how this scales with bigger datasets.

  2. For each iteration of creating new fold columns, it creates num_fold_cols columns. This was kind of a lazy implementation. Could perhaps save time by adding 1 or a few at a time.

Seems like there's room for improvement.

Feature request: add downsampling and upsampling

O love this package. It is very helpful with the preparation for mixed models analysis and also with cross-validation. It would be great if we could also balance nested data by downsampling or upsampling before running the model. Anyway, thank you for the great work you have done.

in partition/fold: restriction on having different categories within an ID

groupdata2 looks great and solves a lot of problems I am looking for!

I tried to use this for one of the problems I am working on and I am running into the following error. I know that this is by design

 The value in 'data[[cat_col]]' must be
 * constant within each ID.

Is it possible to relax your restriction? It is very possible that two measurements from the same group (or ID) could be different. For example, a subject could get two different diagnoses, and we want to make sure they are still either only in training or test, and the diagnoses do count to the total. In the case I'm looking at, the IDs are studies, and examples of both classes can be within in a single study.

E.g., using your df as defined in the cross-validation with group data vignette, let's just change:

df[3, "diagnosis"] <- "b"
parts <- partition(df, p = 0.2, id_col = "participant", cat_col = 'diagnosis')

And then when we run partition() it fails with the above error.

Thanks so much!
Emily

partition usage

thank you for sharing your work.
i have tried to use partition with a 24 element data frame as 25%/75%
parts <- partition(mydfb, p = c(0.25), id_col = "mysubject_index", cat_col = 'mygenotype')
mydfb
mysubject_index mygenotype
1 1 1
2 2 1
3 3 0
4 4 0
5 5 0
6 6 0
7 7 0
8 8 1
9 9 1
10 10 1
11 11 0
12 12 1
13 13 1
14 14 1
15 15 1
16 16 1
17 17 0
18 18 0
19 19 1
20 20 0
21 21 0
22 22 0
23 23 0
24 24 0
and i am surprised to see that it produces a 5 element test_set, and and a 19 element train_set

test_set <- parts[[1]]
train_set <- parts[[2]]
test_set %>% kable()
train_set %>% kable()

mysubject_index mygenotype
4 0
6 0
18 0
12 1
15 1
mysubject_index mygenotype
---------------: ----------:
3 0
5 0
7 0
11 0
17 0
20 0
21 0
22 0
23 0
24 0
1 1
2 1
8 1
9 1
10 1
13 1
14 1
16 1
19 1

It is possible that i am using it wrong, or maybe there is an offset by 1 somewhere?
Many thanks,
Alex

1 test fails on PowerPC: group_counts(c(1:200), c(0.55)) not equal to c(110, 90)

R version 4.3.0 (2023-04-21) -- "Already Tomorrow"
Copyright (C) 2023 The R Foundation for Statistical Computing
Platform: powerpc-apple-darwin10.8.0 (32-bit)

> library(testthat)
> library(groupdata2)
> 
> if (require("xpectr")) {
+   test_check("groupdata2")
+ }
Loading required package: xpectr
[ FAIL 1 | WARN 11 | SKIP 5 | PASS 3525 ]

══ Skipped tests ═══════════════════════════════════════════════════════════════
• Next part is too slow (1)
• On CRAN (1)
• Simulation that runs for a long time (1)
• Skipping bootstrapped numerical balancing test (1)
• Skipping bootstrapped numerical balancing test in partition() (1)

══ Failed tests ════════════════════════════════════════════════════════════════
── Failure ('test_group_factor.R:362:3'): group sizes works with group_factor with method l_sizes ──
group_counts(c(1:200), c(0.55)) not equal to c(110, 90).
2/2 mismatches (average diff: 1)
[1] 109 - 110 == -1
[2]  91 -  90 ==  1

[ FAIL 1 | WARN 11 | SKIP 5 | PASS 3525 ]
Error: Test failures
Execution halted

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.