ludvigolsen / groupdata2 Goto Github PK

R-package: Methods for dividing data into groups. Create balanced partitions and cross-validation folds. Perform time series windowing and general grouping and splitting of data. Balance existing groups with up- and downsampling or collapse them to fewer groups.

License: Other

R 99.98% CSS 0.02%

group-factor cross-validation fold staircase balance groups participants data-frame split data

groupdata2's Issues

Package failing with devel version of checkmate

I'm preparing a new release of checkmate, and some minor changes in checkmate are causing your unit tests to break. AFAICT, tests fail because some error message have slightly changed:

  ── 1. Failure: fuzz testing input checks for differs_from_previous() (@test_diff
  `xpectr::strip_msg(...)` threw an error with unexpected message.
  Expected match: "Assertion failed One of the following must apply checkmatecheckstringhandlena Must be of type string not factor checkmatechecknumberhandlena Must be of type number not factor"
  Actual message: "Assertion on handlena failed One of the following must apply checkmatecheckstringhandlena Must be of type string not factor checkmatechecknumberhandlena Must be of type number not factor"
  Backtrace:
   1. testthat::expect_error(...)
   6. xpectr::strip_msg(...)
   7. xpectr::stop_if(...)
  
  ── 2. Failure: fuzz testing input checks for differs_from_previous() (@test_diff
  `xpectr::strip_msg(...)` threw an error with unexpected message.
  Expected match: "Assertion failed One of the following must apply checkmatecheckstringhandlena Must be of type string not NULL checkmatechecknumberhandlena Must be of type number not NULL"
  Actual message: "Assertion on handlena failed One of the following must apply checkmatecheckstringhandlena Must be of type string not NULL checkmatechecknumberhandlena Must be of type number not NULL"
  Backtrace:
   1. testthat::expect_error(...)
   6. xpectr::strip_msg(...)
   7. xpectr::stop_if(...)
  
  ── 3. Failure: find_missing_starts() find the right missing starts (@test_find_m
  `xpectr::strip_msg(...)` threw an error with unexpected message.
  Expected match: "Assertion failed One of the following must apply checkmatechecknumericn Must be of type numeric not dataframe checkmatecheckcharactern Must be of type character not dataframe checkmatechecklistn Must be of type list not dataframe"
  Actual message: "Assertion on n failed One of the following must apply checkmatechecknumericn Must be of type numeric not dataframe checkmatecheckcharactern Must be of type character not dataframe checkmatechecklistn Must be of type list not dataframe"
  Backtrace:
   1. testthat::expect_error(...)
   6. xpectr::strip_msg(...)
   7. xpectr::stop_if(...)
  
  ── 4. Failure: fuzz testing input checks for l_starts method in group_factor() (
  `xpectr::strip_msg(...)` threw an error with unexpected message.
  Expected match: "Assertion failed One of the following must apply checkmatecheckstringstartscol May not be NA checkmatecheckcountstartscol May not be NA"
  Actual message: "Assertion on startscol failed One of the following must apply checkmatecheckstringstartscol May not be NA checkmatecheckcountstartscol May not be NA"
  Backtrace:
   1. testthat::expect_error(...)
   6. xpectr::strip_msg(...)
   7. xpectr::stop_if(...)
  
  ══ testthat results  ═══════════════════════════════════════════════════════════
  [ OK: 1378 | SKIPPED: 4 | WARNINGS: 0 | FAILED: 4 ]
  1. Failure: fuzz testing input checks for differs_from_previous() (@test_differs_from_previous.R#640) 
  2. Failure: fuzz testing input checks for differs_from_previous() (@test_differs_from_previous.R#652) 
  3. Failure: find_missing_starts() find the right missing starts (@test_find_missing_starts.R#217) 
  4. Failure: fuzz testing input checks for l_starts method in group_factor() (@test_group_factor.R#1974) 
  
  Error: testthat unit tests failed
  Execution halted

Can you relax the tests so that the current devel version of checkmate passes your tests again?

Fix implementation of multiple unique fold columns for repeated cross-validation

TODO for the new functionality (quick implementation) in fold(), intended for repeated cross-validation (repeatedCV branch):

When detecting identical fold columns it repeats column comparisons in secondary iterations. This is unnecessary. Also, create tests to see how this scales with bigger datasets.
For each iteration of creating new fold columns, it creates num_fold_cols columns. This was kind of a lazy implementation. Could perhaps save time by adding 1 or a few at a time.

Seems like there's room for improvement.

Feature request: add downsampling and upsampling

O love this package. It is very helpful with the preparation for mixed models analysis and also with cross-validation. It would be great if we could also balance nested data by downsampling or upsampling before running the model. Anyway, thank you for the great work you have done.

in partition/fold: restriction on having different categories within an ID

groupdata2 looks great and solves a lot of problems I am looking for!

I tried to use this for one of the problems I am working on and I am running into the following error. I know that this is by design

 The value in 'data[[cat_col]]' must be
 * constant within each ID.

Is it possible to relax your restriction? It is very possible that two measurements from the same group (or ID) could be different. For example, a subject could get two different diagnoses, and we want to make sure they are still either only in training or test, and the diagnoses do count to the total. In the case I'm looking at, the IDs are studies, and examples of both classes can be within in a single study.

E.g., using your df as defined in the cross-validation with group data vignette, let's just change:

df[3, "diagnosis"] <- "b"
parts <- partition(df, p = 0.2, id_col = "participant", cat_col = 'diagnosis')

And then when we run partition() it fails with the above error.

Thanks so much!
Emily

partition usage

thank you for sharing your work.
i have tried to use partition with a 24 element data frame as 25%/75%
parts <- partition(mydfb, p = c(0.25), id_col = "mysubject_index", cat_col = 'mygenotype')
mydfb
mysubject_index mygenotype
1 1 1
2 2 1
3 3 0
4 4 0
5 5 0
6 6 0
7 7 0
8 8 1
9 9 1
10 10 1
11 11 0
12 12 1
13 13 1
14 14 1
15 15 1
16 16 1
17 17 0
18 18 0
19 19 1
20 20 0
21 21 0
22 22 0
23 23 0
24 24 0
and i am surprised to see that it produces a 5 element test_set, and and a 19 element train_set

test_set <- parts[[1]]
train_set <- parts[[2]]
test_set %>% kable()
train_set %>% kable()

mysubject_index	mygenotype
4	0
6	0
18	0
12	1
15	1
mysubject_index	mygenotype
---------------:	----------:
3	0
5	0
7	0
11	0
17	0
20	0
21	0
22	0
23	0
24	0
1	1
2	1
8	1
9	1
10	1
13	1
14	1
16	1
19	1

It is possible that i am using it wrong, or maybe there is an offset by 1 somewhere?
Many thanks,
Alex

1 test fails on PowerPC: group_counts(c(1:200), c(0.55)) not equal to c(110, 90)

R version 4.3.0 (2023-04-21) -- "Already Tomorrow"
Copyright (C) 2023 The R Foundation for Statistical Computing
Platform: powerpc-apple-darwin10.8.0 (32-bit)

> library(testthat)
> library(groupdata2)
> 
> if (require("xpectr")) {
+   test_check("groupdata2")
+ }
Loading required package: xpectr
[ FAIL 1 | WARN 11 | SKIP 5 | PASS 3525 ]

══ Skipped tests ═══════════════════════════════════════════════════════════════
• Next part is too slow (1)
• On CRAN (1)
• Simulation that runs for a long time (1)
• Skipping bootstrapped numerical balancing test (1)
• Skipping bootstrapped numerical balancing test in partition() (1)

══ Failed tests ════════════════════════════════════════════════════════════════
── Failure ('test_group_factor.R:362:3'): group sizes works with group_factor with method l_sizes ──
group_counts(c(1:200), c(0.55)) not equal to c(110, 90).
2/2 mismatches (average diff: 1)
[1] 109 - 110 == -1
[2]  91 -  90 ==  1

[ FAIL 1 | WARN 11 | SKIP 5 | PASS 3525 ]
Error: Test failures
Execution halted

ludvigolsen / groupdata2 Goto Github PK

groupdata2's Issues

Package failing with devel version of checkmate

Fix implementation of multiple unique fold columns for repeated cross-validation

Feature request: add downsampling and upsampling

in partition/fold: restriction on having different categories within an ID

partition usage

1 test fails on PowerPC: group_counts(c(1:200), c(0.55)) not equal to c(110, 90)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent