Python 2.31% Jupyter Notebook 97.69%

featureaggregation_single_cell's People

Contributors

Watchers

featureaggregation_single_cell's Issues

Deciding on datasets to use

@EchteRobert asked

last week I believe you mentioned that we should perhaps pick a different dataset to start than the Stain5(?) dataset. Do you remember which one(s) you had in mind instead?

@niranjchandrasekaran had said:

Here are some options
Stain2 and Stain3 - Lots of different experimental conditions; 1 plate per condition (no replicates). These two could be good datasets to burn through while Robert is coming up with his methods.
Stain4, Plate1, Reagent1 and Stain5 - Lots of different conditions; 3 or 4 replicate plates per condition. I guess Stain5 is the best dataset, so it could perhaps be used as a holdout set.

Based on this, I vote for starting with Stain2, @EchteRobert

01. First model for CPJUMP1 compound plates

This first line of experiments will be designed to beat the PR baseline of profiles created with the current aggregation method (which takes the mean). All information about the compound plates used can be found here: https://github.com/jump-cellpainting/2021_Chandrasekaran_submitted.

General data analysis

Creating plate clusters for training and validation splits

In order to split all the Stain2, Stain3, Stain4, and Stain5 (condition C) plates into clusters that are most similar to each other, I created a hierarchical cluster map based on the PC1 loadings of the mean aggregate profiles of these plates. First including all the outliers and then iteratively removing these until I find x clusters which are similar enough so that they can be used for training and validation.

Main takeaways

The final 7 clusters were chosen based on the largest clusters that can be seen in the final clustermap iteration.
Cluster number 5, 6, and 7 are the highest quality clusters, i.e. have the highest correlation of the PC1 loadings of the plates.
We already now that the model beats the baseline on cluster 1 for most plates, except on BR00113818 and BR00112199. However, note that this is actually one of the most diverse diverse clusters of the 7 I have created here.

Clustermap all plates

plate cluster 1: 3 plates
plate cluster 2: 67 plates
plate cluster 3: 9 plates
plate cluster 4: 1 plate

Clustermap remove iteration 3

plate cluster 1: 8 plates
plate cluster 2: 8 plates
plate cluster 3: 41 plates
plate cluster 4: 2 plates
plate cluster 5: 9 plates
plate cluster 6: 2 plates
plate cluster 7: 3 plates

Clustermap final iteration

Cluster numbering goes from top to bottom (where the bottom right cluster is number 7)

Clusters final iteration

	plate	cluster
0	BR00113818	1
1	BR00112198	1
2	BR00112204	1
3	BR00112199	1
4	BR00112201	1
5	BR00112197repeat	1
6	BR00112202	1
7	BR00112197binned	1
8	BR00112197standard	1

	plate	cluster
25	BR00116621highexp	2
29	BR00116624bin1	2
30	BR00116624highexp	2
31	BR00116621bin1	2
35	BR00116620bin1	2
40	BR00116620highexp	2

	plate	cluster
23	BR00116632highexp	3
24	BR00116622	3
27	015124-V	3
33	BR00116622highexp	3
34	BR00116633highexp	3
36	BR00116634highexp	3
39	015124-Vhighexp	3

	plate	cluster
9	BR00115129	4
10	BR00115128	4
11	BR00115133highexp	4
12	BR00115133	4
13	BR00115127	4
14	BR00115131	4
15	BR00115125	4
16	BR00115134	4
17	BR00115125highexp	4
18	BR00115128highexp	4

	plate	cluster
19	BR00116630	5
20	BR00116625	5
21	BR00116631	5
22	BR00116627	5
26	BR00116630highexp	5
28	BR00116629highexp	5
32	BR00116627highexp	5
37	BR00116628highexp	5
38	BR00116625highexp	5
41	BR00116631highexp	5
42	BR00116628	5
43	BR00116629	5

	plate	cluster
48	BR00120277	6
50	BR00120276	6
51	BR00120274	6
52	BR00120275	6
53	BR00120271	6
54	BR00120270	6
56	BR00120272	6
57	BR00120273	6

	plate	cluster
44	BR00120272confocal	7
45	BR00120277confocal	7
46	BR00120274confocal	7
47	BR00120271confocal	7
49	BR00120276confocal	7
55	BR00120273confocal	7
58	BR00120270confocal	7
59	BR00120275confocal	7

03. Model for Stain2

It is now clear that this feature aggregation model will only serve a certain feature set (meaning a certain dataset line), and is not developed to be able to aggregate any feature set (it is only invariant to the number of cells per well). I will start with creating a model that is able to beat the 'mean aggregation' baselines of the Stain2 batches, and then move forward to Stain3, Stain4, and finally use Stain5 as a final testset.

Because of that it would be ideal if all features across Stain datasets were the same. This is (somewhat) the case across Stain2, Stain3, and Stain4. However, Stain5 has a slightly different cellprofiler pipeline resulting in a different and larger feature set. During preprocessing I found that the pipeline from raw single-cell features to data that can directly be fed to the model, is quite a slow process. This is especially the case when all features are used (in this case 4295 for Stain 2-4 and 5794 for Stain 5). The model inference and training also becomes increasingly slower as the number of features increases. From the initial experiments on CPJUMP1 we saw that not all features are needed to create a better profile than the baseline (#1). This is why I have chosen to select only all common features across Stain 2-5. This has the advantage of speed, both in preprocessing and inference, and compatibility, as no separate model will have to be trained to use Stain5 as the test set.

Assuming that the features across Stain2, Stain3, Stain4, and Stain5 are consistent within each experiment, there are 1324 features which are measured in all of them. The features are well distributed in terms of category: Cells: 441 features, Cytoplasm: 433 features, and Nuclei: 450 features. 1124 of them are decently uncorrelated (<abs(0.5) Pearsson correlation) [one plate tested]. From hereon these are the features that will be used to train the model.

General model experiments

This issue is used to test more general aspects of model development not directly related to, but likely still influenced by, the dataset or model hyperparameters that are used at that time.

P2 02. Final model for the LINCS dataset (batch 1)

Here I trained a model on all data available from batch 1 in the LINCS dataset, which can be found like this: aws s3 ls s3://cellpainting-gallery/cpg0004-lincs/broad/workspace/backend/2016_04_01_a549_48hr_batch1/

The model uses 1745 features, because of an issue with 10 plates (broadinstitute/lincs-cell-painting#88 (comment)). In total, I trained the model on 136 plates, 5965 wells, including 1228 unique compounds using the 10 uM dose point. During preprocessing I removed 1587 wells due to missing MoA or compound name (pert_iname) annotation. I used the following hyperparameters:

Hyperparameter	value
batch size	36
epochs	100
kFilters	0.5
latent dim	2048
learning rate	0.0005
nr cells	(1500, 800)
nr sets	8
optimizer	AdamW
output dim	2048
true batch size	288

I assess the model on the 10 uM dose point using replicate and MoA prediction and similarly on the 3.33 uM dose, which is considered the test set.

Results

The model significantly improves upon the average baseline for replicate and MoA prediction for both the 10 and 3.33 uM dose points.
It improves the mAP by 60% and 30% for the 10 uM dose point (training) and 3.33 uM dose point (test) data, respectively.
I could have trained the model a bit longer, e.g. 150 epochs, as the validation mAP and loss did not fully converge yet.

Results 10 uM dose point

Replicate prediction
Welch's t-test between mlp mAP and bm mAP: Ttest_indResult(statistic=84.81208433212997, pvalue=0.0)

plate	Training mAP model	Training mAP BM	Training mAP shuffled
all plates	0.7473	0.269	0

MoA prediction

Welch's t-test between mlp mAP and bm mAP: Ttest_indResult(statistic=6.753694914168434, pvalue=1.5518902810751288e-11)

plate	mAP model	mAP BM	mAP shuffled
all plates	0.0541	0.0338	0.0002

Results 3.33 uM dose point

Replicate prediction
Welch's t-test between mlp mAP and bm mAP: Ttest_indResult(statistic=49.02599189522616, pvalue=0.0)

plate	Training mAP model	Training mAP BM	Training mAP shuffled
all plates	0.4465	0.1695	0

MoA prediction
Welch's t-test between mlp mAP and bm mAP: Ttest_indResult(statistic=3.525483296865904, pvalue=0.0004250301209859708)

plate	mAP model	mAP BM	mAP shuffled
all plates	0.042	0.0322	0

Loss curves

All plate names

SQ00014812_SQ00014813_SQ00014814_SQ00014815_SQ00014816_SQ00014817_SQ00014818_SQ00014819_SQ00014820_SQ00015041_SQ00015042_SQ00015043_SQ00015044_SQ00015045_SQ00015046_SQ00015047_SQ00015048_SQ00015049_SQ00015050_SQ00015051_SQ00015052_SQ00015053_SQ00015054_SQ00015055_SQ00015056_SQ00015057_SQ00015058_SQ00015059_SQ00015096_SQ00015097_SQ00015098_SQ00015099_SQ00015100_SQ00015101_SQ00015102_SQ00015103_SQ00015105_SQ00015106_SQ00015107_SQ00015108_SQ00015109_SQ00015110_SQ00015111_SQ00015112_SQ00015116_SQ00015117_SQ00015118_SQ00015119_SQ00015120_SQ00015121_SQ00015122_SQ00015123_SQ00015124_SQ00015125_SQ00015126_SQ00015127_SQ00015128_SQ00015129_SQ00015130_SQ00015131_SQ00015132_SQ00015133_SQ00015134_SQ00015135_SQ00015136_SQ00015137_SQ00015138_SQ00015139_SQ00015140_SQ00015141_SQ00015142_SQ00015143_SQ00015144_SQ00015145_SQ00015146_SQ00015147_SQ00015148_SQ00015149_SQ00015150_SQ00015151_SQ00015152_SQ00015153_SQ00015154_SQ00015155_SQ00015156_SQ00015157_SQ00015158_SQ00015159_SQ00015160_SQ00015162_SQ00015163_SQ00015164_SQ00015165_SQ00015166_SQ00015167_SQ00015168_SQ00015169_SQ00015170_SQ00015171_SQ00015172_SQ00015173_SQ00015194_SQ00015195_SQ00015196_SQ00015197_SQ00015198_SQ00015199_SQ00015200_SQ00015201_SQ00015202_SQ00015203_SQ00015204_SQ00015205_SQ00015206_SQ00015207_SQ00015208_SQ00015209_SQ00015210_SQ00015211_SQ00015212_SQ00015214_SQ00015215_SQ00015216_SQ00015217_SQ00015218_SQ00015219_SQ00015220_SQ00015221_SQ00015222_SQ00015223_SQ00015224_SQ00015229_SQ00015230_SQ00015231_SQ00015232_SQ00015233

07. Model for Stain5

In this issue I will post all the results on Stain5 with various models and with various evaluation metrics. I first trained a model on 5 plates from Stain2, Stain3, and Stain4, for a total of 15 training plates. I then use this model to run inference directly on Stain5. After that I fine-tune the model (transfer learning) using 3 plates from Stain5 and 10, 20, 40, and 80% of the available compounds for training/fine-tuning. I also tested this with 1 plate and the same fractions, but multiple plates are required for generalizing to the feature patterns of Stain5.
I fine-tune the model by training for 100 epochs and then taking the best validation mAP model. There are perhaps better ways to do this, but this is a first proof-of-concept experiment.

Main takeaways

As expected, the model does not directly translate to Stain5. This is not due plate/batch effects as those are much smaller. Instead, it is due to different experimental conditions which cause a shift in the features distributions, so that the model is no longer able to correctly aggregated the single cell data. See #7 (comment) for the hierarchical cluster map showing that Stain2, Stain3, and Stain4 are in a different cluster than Stain5 altogether.
Secondly, fine-tuning the model does increase the performance on plates that are similar to the training data (i.e. using confocal plates will increase performance on confocal plates and using widefield will increase performance on widefield, not both at the same time). However, in order to generalize to unseen compounds I need to use more than 3 plates. This is the same issue I was having before with training the models: using 1 training plate does not generalize to unseen plates and using 3 training plates does not generalize to unseen compounds. So I probably need to use 5 or more plates to generalize to this type of data.

Results

Out of distribution model

plate	Training mAP model	Training mAP BM	Validation mAP model	Validation mAP BM	PR model	PR BM	Batch
BR00120530	0.26	0.28	0.23	0.39	58.9	58.9	CondA PE
BR00120530confocal	0.03	0.29	0.03	0.4	5.6	56.7	CondA PE
BR00120526confocal	0.03	0.29	0.02	0.36	3.3	58.9	CondA Thermo
BR00120526	0.35	0.28	0.38	0.37	72.2	56.7	CondA Thermo
BR00120536confocal	0.06	0.25	0.05	0.37	17.8	55.6	CondB PE
BR00120536	0.02	0.25	0.03	0.35	4.4	54.4	CondB PE
BR00120532confocal	0.05	0.24	0.06	0.35	8.9	50	CondB Thermo
BR00120532	0.15	0.23	0.18	0.38	36.7	50	CondB Thermo
BR00120274	0.17	0.23	0.21	0.34	31.1	54.4	CondC PE
BR00120274confocal	0.03	0.21	0.03	0.36	4.4	52.2	CondC PE
BR00120270	0.25	0.26	0.29	0.38	55.6	48.9	CondC Thermo
BR00120270confocal	0.02	0.26	0.03	0.39	2.2	52.2	CondC Thermo

Fine-tuned model 10%

plate	Training mAP model	Training mAP BM	Validation mAP model	Validation mAP BM	PR model	PR BM	Batch
BR00120530	0.22	0.28	0.2	0.39	43.3	58.9	CondA PE
BR00120530confocal	0.03	0.29	0.03	0.4	5.6	56.7	CondA PE
BR00120526confocal	0.03	0.29	0.02	0.36	3.3	58.9	CondA Thermo
BR00120526	0.3	0.28	0.36	0.37	53.3	56.7	CondA Thermo
BR00120536confocal	0.06	0.25	0.05	0.37	10	55.6	CondB PE
BR00120536	0.03	0.25	0.04	0.35	5.6	54.4	CondB PE
BR00120532confocal	0.05	0.24	0.07	0.35	11.1	50	CondB Thermo
BR00120532	0.13	0.23	0.18	0.38	27.8	50	CondB Thermo
BR00120274	0.16	0.23	0.2	0.34	27.8	54.4	CondC PE
BR00120274confocal	0.03	0.21	0.04	0.36	5.6	52.2	CondC PE
BR00120270	0.23	0.26	0.28	0.38	42.2	48.9	CondC Thermo
BR00120270confocal	0.03	0.26	0.03	0.39	4.4	52.2	CondC Thermo

Fine-tuned model 20%

plate	Training mAP model	Training mAP BM	Validation mAP model	Validation mAP BM	PR model	PR BM	Batch
BR00120530	0.22	0.28	0.23	0.39	57.8	58.9	CondA PE
BR00120530confocal	0.03	0.29	0.02	0.4	5.6	56.7	CondA PE
BR00120526confocal	0.02	0.29	0.02	0.36	2.2	58.9	CondA Thermo
BR00120526	0.32	0.28	0.31	0.37	75.6	56.7	CondA Thermo
BR00120536confocal	0.05	0.25	0.05	0.37	17.8	55.6	CondB PE
BR00120536	0.09	0.25	0.03	0.35	23.3	54.4	CondB PE
BR00120532confocal	0.05	0.24	0.06	0.35	10	50	CondB Thermo
BR00120532	0.27	0.23	0.21	0.38	61.1	50	CondB Thermo
BR00120274	0.22	0.23	0.21	0.34	53.3	54.4	CondC PE
BR00120274confocal	0.03	0.21	0.04	0.36	8.9	52.2	CondC PE
BR00120270	0.33	0.26	0.34	0.38	75.6	48.9	CondC Thermo
BR00120270confocal	0.03	0.26	0.04	0.39	3.3	52.2	CondC Thermo

Fine-tuned model 40%

plate	Training mAP model	Training mAP BM	Validation mAP model	Validation mAP BM	PR model	PR BM	Batch
Fine-tune plates
BR00120526	0.35	0.28	0.34	0.37	78.9	56.7	CondA Thermo
BR00120532	0.27	0.23	0.21	0.38	57.8	50	CondB Thermo
BR00120270	0.36	0.26	0.33	0.38	78.9	48.9	CondC Thermo
Hold-out plates
BR00120530	0.26	0.28	0.22	0.39	65.6	58.9	CondA PE
BR00120530confocal	0.03	0.29	0.02	0.4	3.3	56.7	CondA PE
BR00120526confocal	0.02	0.29	0.03	0.36	3.3	58.9	CondA Thermo
BR00120536confocal	0.05	0.25	0.04	0.37	17.8	55.6	CondB PE
BR00120536	0.06	0.25	0.04	0.35	14.4	54.4	CondB PE
BR00120532confocal	0.05	0.24	0.06	0.35	15.6	50	CondB Thermo
BR00120274	0.25	0.23	0.22	0.34	54.4	54.4	CondC PE
BR00120274confocal	0.03	0.21	0.03	0.36	4.4	52.2	CondC PE
BR00120270confocal	0.03	0.26	0.03	0.39	6.7	52.2	CondC Thermo

Fine-tuned model 80%

plate	Training mAP model	Training mAP BM	Validation mAP model	Validation mAP BM	PR model	PR BM	Batch
Fine-tune plates
BR00120526	0.37	0.28	0.39	0.37	85.6	56.7	CondA Thermo
BR00120532	0.3	0.23	0.26	0.38	78.9	50	CondB Thermo
BR00120270	0.39	0.26	0.36	0.38	87.8	48.9	CondC Thermo
Hold-out plates
BR00120530	0.25	0.28	0.23	0.39	63.3	58.9	CondA PE
BR00120530confocal	0.03	0.29	0.03	0.4	8.9	56.7	CondA PE
BR00120526confocal	0.03	0.29	0.03	0.36	2.2	58.9	CondA Thermo
BR00120536confocal	0.06	0.25	0.06	0.37	23.3	55.6	CondB PE
BR00120536	0.05	0.25	0.03	0.35	26.7	54.4	CondB PE
BR00120532confocal	0.05	0.24	0.08	0.35	17.8	50	CondB Thermo
BR00120274	0.26	0.23	0.26	0.34	60	54.4	CondC PE
BR00120274confocal	0.03	0.21	0.04	0.36	3.3	52.2	CondC PE
BR00120270confocal	0.03	0.26	0.03	0.39	8.9	52.2	CondC Thermo

XX. Possible future experiments

In this issue, I will outline experiments that may be useful in the future to further investigate the inner-workings of the model, but for which I currently do not have the time (or have low priority).

06. Final model iterations

Two cluster training data (T: S3+S4)

Some final tweaks to training the model will be made in this issue. All of these tweaks will be made with Stain2, Stain3, and Stain4 in mind at the same time, in stead of 1 at a time. The first model is trained on 3 plates from Stain3 and Stain4 at the same time and evaluated on Stain2, Stain3, and Stain4.

Main takeaways

It's possible to generalize to clusters outside of the trained clusters by using training data from at least two clusters at the same time.
This actually also improves overall performance on validation mAP for plates within the training cluster. It is slightly worse than the best model trained on the Stain2 cluster specifically.

Table Stain4

plate	Training mAP model	Training mAP BM	Validation mAP model	Validation mAP BM	PR model	PR BM
Training plates
BR00116625highexp	0.74	0.32	0.36	0.28	98.9	61.1
BR00116628highexp	0.73	0.32	0.32	0.31	98.9	57.8
BR00116629highexp	0.78	0.29	0.35	0.29	100	52.2
Validation plates
BR00116631highexp	0.47	0.28	0.27	0.3	93.3	53.3
BR00116625	0.6	0.31	0.35	0.29	98.9	58.9
BR00116630highexp	0.52	0.29	0.3	0.3	97.8	58.9
BR00116631	0.5	0.3	0.26	0.28	94.4	57.8
BR00116627highexp	0.55	0.31	0.38	0.27	98.9	56.7
BR00116627	0.55	0.3	0.36	0.29	96.7	56.7
BR00116629	0.61	0.3	0.32	0.29	98.9	52.2
BR00116628	0.58	0.32	0.28	0.29	98.9	58.9

Table Stain3

plate	Training mAP model	Training mAP BM	Validation mAP model	Validation mAP BM	PR model	PR BM
Training plates
BR00115134	0.75	0.37	0.42	0.33	98.9	58.9
BR00115125	0.75	0.36	0.44	0.29	98.9	54.4
BR00115133highexp	0.76	0.38	0.38	0.31	97.8	60
Validation plates
BR00115128highexp	0.52	0.4	0.42	0.33	97.8	58.9
BR00115125highexp	0.58	0.37	0.41	0.31	98.9	55.6
BR00115131	0.54	0.38	0.44	0.29	98.9	58.9
BR00115126	0.34	0.32	0.33	0.28	57.8	53.3
BR00115133	0.58	0.38	0.4	0.3	96.7	62.2
BR00115127	0.56	0.38	0.47	0.31	98.9	58.9
BR00115128	0.53	0.39	0.42	0.32	96.7	61.1
BR00115129	0.57	0.38	0.45	0.32	98.9	52.2

Table Stain2

plate	Training mAP model	Training mAP BM	Validation mAP model	Validation mAP BM	PR model	PR BM
BR00112202	0.43	0.34	0.38	0.3	88.9	54.4
BR00112197standard	0.45	0.4	0.41	0.28	85.6	56.7
BR00112198	0.43	0.35	0.4	0.3	91.1	56.7
BR00112197repeat	0.43	0.41	0.37	0.31	81.1	63.3
BR00112204	0.4	0.35	0.46	0.29	82.2	58.9
BR00112197binned	0.43	0.41	0.39	0.3	86.7	58.9
BR00112201	0.47	0.4	0.41	0.32	91.1	66.7

04. Model for Stain3

To test the generalization of the model trained on Stain2, I will now evaluate it on Stain3. Based on the results, further advancements will be made by training on plates of Stain3 (and possibly then evaluating on Stain2 in turn).

Stain 3 consists of 17 plates which were divided into 4 "batches" which are defined based on the analysis pipeline used.

Standard
Multiplane
HighExp
Bin1

To analyze the relations between the different plates in Stain3, I calculated the correlation between the PC1 loadings of the mean aggregated profiles of every plate. The BR00115130 plate stands out. This agrees with the findings in https://github.com/jump-cellpainting/pilot-analysis/issues/20 where this plate achieved the lowest PR, likely due to a lot wells having evaporated or something else being messed up. Another plate that stand out is BR00115132. This plate contains a 4-fold dilution of all dyes. BR00115130 will be left out of analysis altogether. BR00115132 will not be left out, although it will not be used as a training plate.

Number of cells per well per plate

05. Model for Stain4

To test the generalization of the model trained on Stain3 (and tested on Stain2), I will now evaluate it on Stain4. Based on the results, further advancements will be made by training on plates of Stain4 (and then evaluating on Stain2 and Stain3 in turn).

Stain 4 consists of 30 plates which were divided into 5 batches, each with different staining conditions.

Baseline staining conditions used in Stain3 (Stain2)
Baseline staining conditions used in Stain3 with 2-fold dilution of all dyes (Stain2_2)
Baseline staining conditions used in Stain3 with 2-fold dilution of Phalloidin (Stain2_Phalloidin_2)
Baseline staining conditions used in Stain3 with 2-fold dilution of Phalloidin and ConA (Stain2_Phalloidin_ConA_2)
Bray et al. staining conditions (Bray)

Apart from that, standard exposure vs. high exposure and Binning 1 vs. Binning 2 comparisons were also made.

To analyze the relations between the different plates in Stain4, I calculated the correlation between the PC1 loadings of the mean aggregated profiles of every plate. I only included the plates that were similar enough to form a large cluster.

Click here for clusters!

Click here for cells per well per plate!

Only the plates I have downloaded so far are available, but will still give a good indication of this dataset

P2 01. Testing the method on subsets of the LINCS dataset

LINCS contains 6 dose points: 0.04 µM, 0.12 µM, 0.37 µM, 1.11 µM, 3.33 µM, and 10 µM. For my experiments, I will use the highest dose (10 µM) as the training set and the validation set. The model is trained to create profiles that attract replicate compound profiles and repel non-replicate compound profiles. It is then validated by evaluating the ability of these profiles to predict MoAs (or find sister compounds). Finally, the model will be tested on the 3.3 µM dose point data as a hold-out set. This data should look significantly different from the training and validation data.

I will follow the same data exclusion protocol as Michael did in his research:

“The evaluation metrics all require the number of compounds in an MOA to be at least two (MOA class size of at least two). A compound that does not have other compounds annotated with the same MOA cannot have a precision value and does not affect Enrichment and Hit@k. Additionally, there is no use for compounds with unknown MOA labels, because I can not apply any metrics. Finally, I used only a single dose per compound to make experimental iterations practical. I picked the 10uM dose point because we found that this dose produces the strongest phenotypes across compounds without reducing the ability to group MOAs [42]. These three requirements lead to a subset of the LINCS data in which I 1) only keep the highest dose of each perturbation, 2) delete all compounds that have unknown MOA labels, and 3) drop all compounds with MOA sizes smaller than two. All further processing and experiment are based on this subset of data. This subset process is documented in my repository.”

Which should result in similar data numbers:

“The resulting (sub-selected) dataset comprises 8,818 wells spanning over 136 plates, five different batches, and 18.2 million cells. Each of the 1,144 perturbations (compounds) have five technical replicates - each on a separate plate but in the same position on the plate (same well position). Four of these technical replicate wells were produced in the same batch. This is highly relevant for training and technical artifacts, because I expect cells in the same well and same batch to have similar technical artifacts, but the corresponding wells from a different batch may look very different from one another. Each of the 136 plates has 24 negative controls (DMSO) distributed across the plate (3264 DMSOs in total). The 1,144 compounds can be categorized into 235 MOAs, and every MOA has at least two compounds (MOA class size > 1). Some MOAs, such as phosphodiesterase inhibitor, have an MOA class size of 35, while most other MOAs are only found for two compounds.”

I have put some relevant quotes from Michael's work and the LINCS manuscript here: https://docs.google.com/document/d/1z2U5o91vzBwB-4xtryYn5d3kSWJi8ZSerE_MAzdLT_0/edit#heading=h.sbi6l2r6p5ec

02. Model for CPJUMP1 + Stain2

The first results here are generated by the lastly presented model (BS64) in #1. Note that Stain2 confocal data contains 1433 features instead of the 1938 in CPJUMP1. So some features are randomly selected and repeated to feed the data to the model.

Main takeaways

As expected, the model does not directly translate to the Stain2 dataset. I expect the cause to be two-fold:

Stain2 has measured less and perhaps even different features, the model has likely not learned to interpret general feature distributions yet but has overfit to the features present in CPJUMP1.
Stain2 contains different compounds, which contain different feature distributions than the compounds found in CPJUMP1.

Next steps: find a way for the model to generalize to different features and compounds.

Model trained on CPJUMP1 compound data on Stain2_Batch2_Binned PR // BENCHMARK: 0.6 PS

Model trained on CPJUMP1 compound data on Stain2_Batch2_Confocal PR // BENCHMARK: 0.533

broadinstitute / featureaggregation_single_cell Goto Github PK

featureaggregation_single_cell's People

Contributors

Watchers

featureaggregation_single_cell's Issues

Creating plate clusters for training and validation splits

Main takeaways

Results

Main takeaways

Results

Two cluster training data (T: S3+S4)

Main takeaways

Main takeaways

Recommend Projects

Recommend Topics

Recommend Org