Comments (6)
Hi Emily,
You are right, that there are times when the same ID could reasonably have two different classes and it's definitely something I should look into supporting in the future. Right now though, the implementation doesn't make it possible:
With categorical balancing: 1) Split the dataset by the cat_col
column. 2) Partition each split. 3) Combine the partitions (first partition from split 1 with first partition from split 2, etc.)
With id_col
: 1) Extract the unique IDs. 2) Partition them. 3) Put all rows for the ID in its partition.
With both: 1) Extract the unique IDs with their classes. 2) Do the categorical balancing. 3) Put all rows for the ID in its partition.
So, in the "Both" version, if we had multiple classes within an ID, the approach won't work, as I would either have multiple rows per ID (that may end up in different partitions) or only consider one of the classes for each ID (would happen in the current code, I believe).
It's very possible, that I can think of a way to solve this when it's not 3 am :)
For now, I'll suggest that you only use the id_col
, and perhaps run the partitioning a couple of times with different seeds and use the one that has the best distribution of the classes. Whether this is useful of course depends on the dataset.
Alternatively, if it's only two classes, you could make an extra class if the study has both. So you could make a column new_class
with the classes 0
,1
, and Both
. Then use cat_col = new_class
. That should at least make sure that every class is included in each partition, although it won't necessarily be well-balanced.
Let me know if those suggestions would work for your project and if you need help implementing them. :)
Best,
Ludvig
from groupdata2.
Thank you! That makes a lot of sense re the implementation in the package.
I have this implemented using tidyverse and it works (included below), it's just long and not exactly the same but works well enough for me-- but there is probably a cleaner and faster way :).
require('tidyverse')
sep_studies <- function(num_studies, nfolds){
# separate studies into a certain number of folds
my_l <- c(1:nfolds, nfolds:1)
if (runif(1, 0, 1) >= 0.5){ # note that because of this, it can be randomly off by a few samples, and this varies
my_l <- c(nfolds:1, 1:nfolds)
}
if (num_studies < (2*nfolds)){
return(my_l[1:num_studies])
}
else {
num_reps <- num_studies %/% (2*nfolds)
num_rem <- num_studies %% (2*nfolds)
if (num_rem==0){
return(rep(my_l, num_reps))
} else{
return(c(rep(my_l, num_reps), my_l[1:num_rem]))
}
}
}
partition_group_data <- function(df, grp_col="grp", class_col="class", nfolds=2){
# rename the columns for analysis
colnames(df)[colnames(df)==grp_col] <- "grp"
colnames(df)[colnames(df)==class_col] <- "class"
# get the counts by class in each grp
study_counts_by_class <- df %>%
mutate(grp=as.factor(grp)) %>%
group_by(grp, class) %>%
count() %>%
ungroup() %>%
pivot_wider(names_from=class, values_from=n, names_prefix="num", values_fill=c(n=0))
# shuffle and partition
partitioned_data <- study_counts_by_class %>%
group_by_if(is.numeric) %>%
sample_n(n()) %>%
mutate(partition=unlist(sep_studies(n(), nfolds))) %>%
ungroup()
# add the sample names back in
samples_to_grps <- partitioned_data %>%
select(grp, partition) %>%
left_join(df, by=c("grp"))
return(samples_to_grps)
}
set.seed(104)
df[3, "diagnosis"] <- "b" # using the same df as before, with the same edit
parts <- partition_group_data(df, grp_col ="participant", class_col="diagnosis", nfolds=2)
parts %>% group_by(partition, class) %>% count() # varies depending on the iteration, but pretty close
from groupdata2.
Just went through your code, and in practice it seems to be in the ball park of the new_class
approach I mentioned (more generalized though). That seems to be a good approach for your situation!
It's unclear to me whether it's the optimal approach in general, so I will need to work with it a bit, but it's definitely a great starting point for thinking about it! Thanks for sharing :)
The code seems to be for fold()
. Do you need a version of this for partition as well?
from groupdata2.
ok! no problem! :) This is something I wish was written when I started the project, so I am sure it will be helpful to others.
I don't - I just use partition into 5 folds and then assign 1-4 to training for now to get an 80/20 split, but it would be a trivial extension, just changing sep_studies()
to work with a fraction instead of just shuffling IDs.
from groupdata2.
Great :)
Any other ideas your get or use cases you find, do let me know :)
from groupdata2.
will do!
from groupdata2.
Related Issues (6)
- 1 test fails on PowerPC: group_counts(c(1:200), c(0.55)) not equal to c(110, 90) HOT 7
- Feature request: add downsampling and upsampling HOT 10
- partition usage HOT 2
- Fix implementation of multiple unique fold columns for repeated cross-validation HOT 1
- Package failing with devel version of checkmate HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from groupdata2.