Giter VIP home page Giter VIP logo

Comments (6)

LudvigOlsen avatar LudvigOlsen commented on June 27, 2024

Hi Emily,

You are right, that there are times when the same ID could reasonably have two different classes and it's definitely something I should look into supporting in the future. Right now though, the implementation doesn't make it possible:

With categorical balancing: 1) Split the dataset by the cat_col column. 2) Partition each split. 3) Combine the partitions (first partition from split 1 with first partition from split 2, etc.)

With id_col: 1) Extract the unique IDs. 2) Partition them. 3) Put all rows for the ID in its partition.

With both: 1) Extract the unique IDs with their classes. 2) Do the categorical balancing. 3) Put all rows for the ID in its partition.

So, in the "Both" version, if we had multiple classes within an ID, the approach won't work, as I would either have multiple rows per ID (that may end up in different partitions) or only consider one of the classes for each ID (would happen in the current code, I believe).

It's very possible, that I can think of a way to solve this when it's not 3 am :)

For now, I'll suggest that you only use the id_col, and perhaps run the partitioning a couple of times with different seeds and use the one that has the best distribution of the classes. Whether this is useful of course depends on the dataset.

Alternatively, if it's only two classes, you could make an extra class if the study has both. So you could make a column new_class with the classes 0,1, and Both. Then use cat_col = new_class. That should at least make sure that every class is included in each partition, although it won't necessarily be well-balanced.

Let me know if those suggestions would work for your project and if you need help implementing them. :)

Best,
Ludvig

from groupdata2.

erflynn avatar erflynn commented on June 27, 2024

Thank you! That makes a lot of sense re the implementation in the package.

I have this implemented using tidyverse and it works (included below), it's just long and not exactly the same but works well enough for me-- but there is probably a cleaner and faster way :).

require('tidyverse')

sep_studies <- function(num_studies, nfolds){
 # separate studies into a certain number of folds
  my_l <- c(1:nfolds, nfolds:1)
  if (runif(1, 0, 1) >= 0.5){ # note that because of this, it can be randomly off by a few samples, and this varies
    my_l <- c(nfolds:1, 1:nfolds)
  }
  if (num_studies < (2*nfolds)){
    
    return(my_l[1:num_studies])
  } 
  else {
    num_reps <- num_studies %/% (2*nfolds)
    num_rem <- num_studies %% (2*nfolds)
    if (num_rem==0){
      return(rep(my_l, num_reps))
    } else{
      return(c(rep(my_l, num_reps), my_l[1:num_rem]))
    }
  }
}

partition_group_data <- function(df, grp_col="grp", class_col="class", nfolds=2){
  # rename the columns for analysis
  colnames(df)[colnames(df)==grp_col] <- "grp"
  colnames(df)[colnames(df)==class_col] <- "class"

  # get the counts by class in each grp
  study_counts_by_class <- df %>% 
    mutate(grp=as.factor(grp)) %>%
    group_by(grp, class) %>% 
    count() %>% 
    ungroup() %>%
    pivot_wider(names_from=class, values_from=n, names_prefix="num", values_fill=c(n=0)) 
  
  # shuffle and partition
  partitioned_data <- study_counts_by_class %>% 
    group_by_if(is.numeric) %>% 
    sample_n(n()) %>%
    mutate(partition=unlist(sep_studies(n(), nfolds)))  %>%
    ungroup()
  
  # add the sample names back in
  samples_to_grps <- partitioned_data %>% 
    select(grp, partition) %>%
    left_join(df, by=c("grp")) 
  
  return(samples_to_grps)
}


set.seed(104) 
df[3, "diagnosis"] <- "b" # using the same df as before, with the same edit
parts <- partition_group_data(df, grp_col ="participant", class_col="diagnosis", nfolds=2) 
parts %>% group_by(partition, class) %>% count() # varies depending on the iteration, but pretty close

from groupdata2.

LudvigOlsen avatar LudvigOlsen commented on June 27, 2024

Just went through your code, and in practice it seems to be in the ball park of the new_class approach I mentioned (more generalized though). That seems to be a good approach for your situation!
It's unclear to me whether it's the optimal approach in general, so I will need to work with it a bit, but it's definitely a great starting point for thinking about it! Thanks for sharing :)

The code seems to be for fold(). Do you need a version of this for partition as well?

from groupdata2.

erflynn avatar erflynn commented on June 27, 2024

ok! no problem! :) This is something I wish was written when I started the project, so I am sure it will be helpful to others.

I don't - I just use partition into 5 folds and then assign 1-4 to training for now to get an 80/20 split, but it would be a trivial extension, just changing sep_studies() to work with a fraction instead of just shuffling IDs.

from groupdata2.

LudvigOlsen avatar LudvigOlsen commented on June 27, 2024

Great :)

Any other ideas your get or use cases you find, do let me know :)

from groupdata2.

erflynn avatar erflynn commented on June 27, 2024

will do!

from groupdata2.

Related Issues (6)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.