groupdata2 looks great and solves a lot of problems I

in partition/fold: restriction on having different categories within an ID about groupdata2 HOT 6 CLOSED

ludvigolsen commented on June 27, 2024

in partition/fold: restriction on having different categories within an ID

from groupdata2.

Comments (6)

LudvigOlsen commented on June 27, 2024

Hi Emily,

You are right, that there are times when the same ID could reasonably have two different classes and it's definitely something I should look into supporting in the future. Right now though, the implementation doesn't make it possible:

With categorical balancing: 1) Split the dataset by the cat_col column. 2) Partition each split. 3) Combine the partitions (first partition from split 1 with first partition from split 2, etc.)

With id_col: 1) Extract the unique IDs. 2) Partition them. 3) Put all rows for the ID in its partition.

With both: 1) Extract the unique IDs with their classes. 2) Do the categorical balancing. 3) Put all rows for the ID in its partition.

So, in the "Both" version, if we had multiple classes within an ID, the approach won't work, as I would either have multiple rows per ID (that may end up in different partitions) or only consider one of the classes for each ID (would happen in the current code, I believe).

It's very possible, that I can think of a way to solve this when it's not 3 am :)

For now, I'll suggest that you only use the id_col, and perhaps run the partitioning a couple of times with different seeds and use the one that has the best distribution of the classes. Whether this is useful of course depends on the dataset.

Alternatively, if it's only two classes, you could make an extra class if the study has both. So you could make a column new_class with the classes 0,1, and Both. Then use cat_col = new_class. That should at least make sure that every class is included in each partition, although it won't necessarily be well-balanced.

Let me know if those suggestions would work for your project and if you need help implementing them. :)

Best,
Ludvig

from groupdata2.

erflynn commented on June 27, 2024

Thank you! That makes a lot of sense re the implementation in the package.

I have this implemented using tidyverse and it works (included below), it's just long and not exactly the same but works well enough for me-- but there is probably a cleaner and faster way :).

require('tidyverse')

sep_studies <- function(num_studies, nfolds){
 # separate studies into a certain number of folds
  my_l <- c(1:nfolds, nfolds:1)
  if (runif(1, 0, 1) >= 0.5){ # note that because of this, it can be randomly off by a few samples, and this varies
    my_l <- c(nfolds:1, 1:nfolds)
  }
  if (num_studies < (2*nfolds)){
    
    return(my_l[1:num_studies])
  } 
  else {
    num_reps <- num_studies %/% (2*nfolds)
    num_rem <- num_studies %% (2*nfolds)
    if (num_rem==0){
      return(rep(my_l, num_reps))
    } else{
      return(c(rep(my_l, num_reps), my_l[1:num_rem]))
    }
  }
}

partition_group_data <- function(df, grp_col="grp", class_col="class", nfolds=2){
  # rename the columns for analysis
  colnames(df)[colnames(df)==grp_col] <- "grp"
  colnames(df)[colnames(df)==class_col] <- "class"

  # get the counts by class in each grp
  study_counts_by_class <- df %>% 
    mutate(grp=as.factor(grp)) %>%
    group_by(grp, class) %>% 
    count() %>% 
    ungroup() %>%
    pivot_wider(names_from=class, values_from=n, names_prefix="num", values_fill=c(n=0)) 
  
  # shuffle and partition
  partitioned_data <- study_counts_by_class %>% 
    group_by_if(is.numeric) %>% 
    sample_n(n()) %>%
    mutate(partition=unlist(sep_studies(n(), nfolds)))  %>%
    ungroup()
  
  # add the sample names back in
  samples_to_grps <- partitioned_data %>% 
    select(grp, partition) %>%
    left_join(df, by=c("grp")) 
  
  return(samples_to_grps)
}


set.seed(104) 
df[3, "diagnosis"] <- "b" # using the same df as before, with the same edit
parts <- partition_group_data(df, grp_col ="participant", class_col="diagnosis", nfolds=2) 
parts %>% group_by(partition, class) %>% count() # varies depending on the iteration, but pretty close

from groupdata2.

LudvigOlsen commented on June 27, 2024

Just went through your code, and in practice it seems to be in the ball park of the new_class approach I mentioned (more generalized though). That seems to be a good approach for your situation!
It's unclear to me whether it's the optimal approach in general, so I will need to work with it a bit, but it's definitely a great starting point for thinking about it! Thanks for sharing :)

The code seems to be for fold(). Do you need a version of this for partition as well?

from groupdata2.

erflynn commented on June 27, 2024

ok! no problem! :) This is something I wish was written when I started the project, so I am sure it will be helpful to others.

I don't - I just use partition into 5 folds and then assign 1-4 to training for now to get an 80/20 split, but it would be a trivial extension, just changing sep_studies() to work with a fraction instead of just shuffling IDs.

from groupdata2.

LudvigOlsen commented on June 27, 2024

Great :)

Any other ideas your get or use cases you find, do let me know :)

from groupdata2.

erflynn commented on June 27, 2024

will do!

from groupdata2.

in partition/fold: restriction on having different categories within an ID about groupdata2 HOT 6 CLOSED

Comments (6)

Related Issues (6)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent