Giter VIP home page Giter VIP logo

featureaggregation_single_cell's People

Contributors

dependabot[bot] avatar echterobert avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

featureaggregation_single_cell's Issues

Deciding on datasets to use

@EchteRobert asked

last week I believe you mentioned that we should perhaps pick a different dataset to start than the Stain5(?) dataset. Do you remember which one(s) you had in mind instead?

@niranjchandrasekaran had said:

Here are some options
Stain2 and Stain3 - Lots of different experimental conditions; 1 plate per condition (no replicates). These two could be good datasets to burn through while Robert is coming up with his methods.
Stain4, Plate1, Reagent1 and Stain5 - Lots of different conditions; 3 or 4 replicate plates per condition. I guess Stain5 is the best dataset, so it could perhaps be used as a holdout set.

Based on this, I vote for starting with Stain2, @EchteRobert

General data analysis

Creating plate clusters for training and validation splits

In order to split all the Stain2, Stain3, Stain4, and Stain5 (condition C) plates into clusters that are most similar to each other, I created a hierarchical cluster map based on the PC1 loadings of the mean aggregate profiles of these plates. First including all the outliers and then iteratively removing these until I find x clusters which are similar enough so that they can be used for training and validation.

Main takeaways

  • The final 7 clusters were chosen based on the largest clusters that can be seen in the final clustermap iteration.
  • Cluster number 5, 6, and 7 are the highest quality clusters, i.e. have the highest correlation of the PC1 loadings of the plates.
  • We already now that the model beats the baseline on cluster 1 for most plates, except on BR00113818 and BR00112199. However, note that this is actually one of the most diverse diverse clusters of the 7 I have created here.
Clustermap all plates

ClusterMapAllStainPlates

plate cluster 1: 3 plates
plate cluster 2: 67 plates
plate cluster 3: 9 plates
plate cluster 4: 1 plate

Clustermap remove iteration 3

ClusterMapRemove3

plate cluster 1: 8 plates
plate cluster 2: 8 plates
plate cluster 3: 41 plates
plate cluster 4: 2 plates
plate cluster 5: 9 plates
plate cluster 6: 2 plates
plate cluster 7: 3 plates

Clustermap final iteration

Cluster numbering goes from top to bottom (where the bottom right cluster is number 7)
ClusterMapFinal_dataset

Clusters final iteration
plate cluster
0 BR00113818 1
1 BR00112198 1
2 BR00112204 1
3 BR00112199 1
4 BR00112201 1
5 BR00112197repeat 1
6 BR00112202 1
7 BR00112197binned 1
8 BR00112197standard 1
plate cluster
25 BR00116621highexp 2
29 BR00116624bin1 2
30 BR00116624highexp 2
31 BR00116621bin1 2
35 BR00116620bin1 2
40 BR00116620highexp 2
plate cluster
23 BR00116632highexp 3
24 BR00116622 3
27 015124-V 3
33 BR00116622highexp 3
34 BR00116633highexp 3
36 BR00116634highexp 3
39 015124-Vhighexp 3
plate cluster
9 BR00115129 4
10 BR00115128 4
11 BR00115133highexp 4
12 BR00115133 4
13 BR00115127 4
14 BR00115131 4
15 BR00115125 4
16 BR00115134 4
17 BR00115125highexp 4
18 BR00115128highexp 4
plate cluster
19 BR00116630 5
20 BR00116625 5
21 BR00116631 5
22 BR00116627 5
26 BR00116630highexp 5
28 BR00116629highexp 5
32 BR00116627highexp 5
37 BR00116628highexp 5
38 BR00116625highexp 5
41 BR00116631highexp 5
42 BR00116628 5
43 BR00116629 5
plate cluster
48 BR00120277 6
50 BR00120276 6
51 BR00120274 6
52 BR00120275 6
53 BR00120271 6
54 BR00120270 6
56 BR00120272 6
57 BR00120273 6
plate cluster
44 BR00120272confocal 7
45 BR00120277confocal 7
46 BR00120274confocal 7
47 BR00120271confocal 7
49 BR00120276confocal 7
55 BR00120273confocal 7
58 BR00120270confocal 7
59 BR00120275confocal 7

03. Model for Stain2

It is now clear that this feature aggregation model will only serve a certain feature set (meaning a certain dataset line), and is not developed to be able to aggregate any feature set (it is only invariant to the number of cells per well). I will start with creating a model that is able to beat the 'mean aggregation' baselines of the Stain2 batches, and then move forward to Stain3, Stain4, and finally use Stain5 as a final testset.

Because of that it would be ideal if all features across Stain datasets were the same. This is (somewhat) the case across Stain2, Stain3, and Stain4. However, Stain5 has a slightly different cellprofiler pipeline resulting in a different and larger feature set. During preprocessing I found that the pipeline from raw single-cell features to data that can directly be fed to the model, is quite a slow process. This is especially the case when all features are used (in this case 4295 for Stain 2-4 and 5794 for Stain 5). The model inference and training also becomes increasingly slower as the number of features increases. From the initial experiments on CPJUMP1 we saw that not all features are needed to create a better profile than the baseline (#1). This is why I have chosen to select only all common features across Stain 2-5. This has the advantage of speed, both in preprocessing and inference, and compatibility, as no separate model will have to be trained to use Stain5 as the test set.

Assuming that the features across Stain2, Stain3, Stain4, and Stain5 are consistent within each experiment, there are 1324 features which are measured in all of them. The features are well distributed in terms of category: Cells: 441 features, Cytoplasm: 433 features, and Nuclei: 450 features. 1124 of them are decently uncorrelated (<abs(0.5) Pearsson correlation) [one plate tested]. From hereon these are the features that will be used to train the model.

General model experiments

This issue is used to test more general aspects of model development not directly related to, but likely still influenced by, the dataset or model hyperparameters that are used at that time.

P2 02. Final model for the LINCS dataset (batch 1)

Here I trained a model on all data available from batch 1 in the LINCS dataset, which can be found like this: aws s3 ls s3://cellpainting-gallery/cpg0004-lincs/broad/workspace/backend/2016_04_01_a549_48hr_batch1/

The model uses 1745 features, because of an issue with 10 plates (broadinstitute/lincs-cell-painting#88 (comment)). In total, I trained the model on 136 plates, 5965 wells, including 1228 unique compounds using the 10 uM dose point. During preprocessing I removed 1587 wells due to missing MoA or compound name (pert_iname) annotation. I used the following hyperparameters:

Hyperparameter value
batch size 36
epochs 100
kFilters 0.5
latent dim 2048
learning rate 0.0005
nr cells (1500, 800)
nr sets 8
optimizer AdamW
output dim 2048
true batch size 288

I assess the model on the 10 uM dose point using replicate and MoA prediction and similarly on the 3.33 uM dose, which is considered the test set.

Results

  • The model significantly improves upon the average baseline for replicate and MoA prediction for both the 10 and 3.33 uM dose points.
  • It improves the mAP by 60% and 30% for the 10 uM dose point (training) and 3.33 uM dose point (test) data, respectively.
  • I could have trained the model a bit longer, e.g. 150 epochs, as the validation mAP and loss did not fully converge yet.
Results 10 uM dose point

Replicate prediction
Welch's t-test between mlp mAP and bm mAP: Ttest_indResult(statistic=84.81208433212997, pvalue=0.0)

plate Training mAP model Training mAP BM Training mAP shuffled
all plates 0.7473 0.269 0

MoA prediction

Welch's t-test between mlp mAP and bm mAP: Ttest_indResult(statistic=6.753694914168434, pvalue=1.5518902810751288e-11)

plate mAP model mAP BM mAP shuffled
all plates 0.0541 0.0338 0.0002
Results 3.33 uM dose point

Replicate prediction
Welch's t-test between mlp mAP and bm mAP: Ttest_indResult(statistic=49.02599189522616, pvalue=0.0)

plate Training mAP model Training mAP BM Training mAP shuffled
all plates 0.4465 0.1695 0

MoA prediction
Welch's t-test between mlp mAP and bm mAP: Ttest_indResult(statistic=3.525483296865904, pvalue=0.0004250301209859708)

plate mAP model mAP BM mAP shuffled
all plates 0.042 0.0322 0
Loss curves Screen Shot 2022-10-05 at 11 19 38 AM Screen Shot 2022-10-05 at 11 20 15 AM
All plate names SQ00014812_SQ00014813_SQ00014814_SQ00014815_SQ00014816_SQ00014817_SQ00014818_SQ00014819_SQ00014820_SQ00015041_SQ00015042_SQ00015043_SQ00015044_SQ00015045_SQ00015046_SQ00015047_SQ00015048_SQ00015049_SQ00015050_SQ00015051_SQ00015052_SQ00015053_SQ00015054_SQ00015055_SQ00015056_SQ00015057_SQ00015058_SQ00015059_SQ00015096_SQ00015097_SQ00015098_SQ00015099_SQ00015100_SQ00015101_SQ00015102_SQ00015103_SQ00015105_SQ00015106_SQ00015107_SQ00015108_SQ00015109_SQ00015110_SQ00015111_SQ00015112_SQ00015116_SQ00015117_SQ00015118_SQ00015119_SQ00015120_SQ00015121_SQ00015122_SQ00015123_SQ00015124_SQ00015125_SQ00015126_SQ00015127_SQ00015128_SQ00015129_SQ00015130_SQ00015131_SQ00015132_SQ00015133_SQ00015134_SQ00015135_SQ00015136_SQ00015137_SQ00015138_SQ00015139_SQ00015140_SQ00015141_SQ00015142_SQ00015143_SQ00015144_SQ00015145_SQ00015146_SQ00015147_SQ00015148_SQ00015149_SQ00015150_SQ00015151_SQ00015152_SQ00015153_SQ00015154_SQ00015155_SQ00015156_SQ00015157_SQ00015158_SQ00015159_SQ00015160_SQ00015162_SQ00015163_SQ00015164_SQ00015165_SQ00015166_SQ00015167_SQ00015168_SQ00015169_SQ00015170_SQ00015171_SQ00015172_SQ00015173_SQ00015194_SQ00015195_SQ00015196_SQ00015197_SQ00015198_SQ00015199_SQ00015200_SQ00015201_SQ00015202_SQ00015203_SQ00015204_SQ00015205_SQ00015206_SQ00015207_SQ00015208_SQ00015209_SQ00015210_SQ00015211_SQ00015212_SQ00015214_SQ00015215_SQ00015216_SQ00015217_SQ00015218_SQ00015219_SQ00015220_SQ00015221_SQ00015222_SQ00015223_SQ00015224_SQ00015229_SQ00015230_SQ00015231_SQ00015232_SQ00015233

07. Model for Stain5

In this issue I will post all the results on Stain5 with various models and with various evaluation metrics. I first trained a model on 5 plates from Stain2, Stain3, and Stain4, for a total of 15 training plates. I then use this model to run inference directly on Stain5. After that I fine-tune the model (transfer learning) using 3 plates from Stain5 and 10, 20, 40, and 80% of the available compounds for training/fine-tuning. I also tested this with 1 plate and the same fractions, but multiple plates are required for generalizing to the feature patterns of Stain5.
I fine-tune the model by training for 100 epochs and then taking the best validation mAP model. There are perhaps better ways to do this, but this is a first proof-of-concept experiment.

Main takeaways

  • As expected, the model does not directly translate to Stain5. This is not due plate/batch effects as those are much smaller. Instead, it is due to different experimental conditions which cause a shift in the features distributions, so that the model is no longer able to correctly aggregated the single cell data. See #7 (comment) for the hierarchical cluster map showing that Stain2, Stain3, and Stain4 are in a different cluster than Stain5 altogether.
  • Secondly, fine-tuning the model does increase the performance on plates that are similar to the training data (i.e. using confocal plates will increase performance on confocal plates and using widefield will increase performance on widefield, not both at the same time). However, in order to generalize to unseen compounds I need to use more than 3 plates. This is the same issue I was having before with training the models: using 1 training plate does not generalize to unseen plates and using 3 training plates does not generalize to unseen compounds. So I probably need to use 5 or more plates to generalize to this type of data.

Results

Out of distribution model
plate Training mAP model Training mAP BM Validation mAP model Validation mAP BM PR model PR BM Batch
BR00120530 0.26 0.28 0.23 0.39 58.9 58.9 CondA PE
BR00120530confocal 0.03 0.29 0.03 0.4 5.6 56.7 CondA PE
BR00120526confocal 0.03 0.29 0.02 0.36 3.3 58.9 CondA Thermo
BR00120526 0.35 0.28 0.38 0.37 72.2 56.7 CondA Thermo
BR00120536confocal 0.06 0.25 0.05 0.37 17.8 55.6 CondB PE
BR00120536 0.02 0.25 0.03 0.35 4.4 54.4 CondB PE
BR00120532confocal 0.05 0.24 0.06 0.35 8.9 50 CondB Thermo
BR00120532 0.15 0.23 0.18 0.38 36.7 50 CondB Thermo
BR00120274 0.17 0.23 0.21 0.34 31.1 54.4 CondC PE
BR00120274confocal 0.03 0.21 0.03 0.36 4.4 52.2 CondC PE
BR00120270 0.25 0.26 0.29 0.38 55.6 48.9 CondC Thermo
BR00120270confocal 0.02 0.26 0.03 0.39 2.2 52.2 CondC Thermo
Fine-tuned model 10%
plate Training mAP model Training mAP BM Validation mAP model Validation mAP BM PR model PR BM Batch
BR00120530 0.22 0.28 0.2 0.39 43.3 58.9 CondA PE
BR00120530confocal 0.03 0.29 0.03 0.4 5.6 56.7 CondA PE
BR00120526confocal 0.03 0.29 0.02 0.36 3.3 58.9 CondA Thermo
BR00120526 0.3 0.28 0.36 0.37 53.3 56.7 CondA Thermo
BR00120536confocal 0.06 0.25 0.05 0.37 10 55.6 CondB PE
BR00120536 0.03 0.25 0.04 0.35 5.6 54.4 CondB PE
BR00120532confocal 0.05 0.24 0.07 0.35 11.1 50 CondB Thermo
BR00120532 0.13 0.23 0.18 0.38 27.8 50 CondB Thermo
BR00120274 0.16 0.23 0.2 0.34 27.8 54.4 CondC PE
BR00120274confocal 0.03 0.21 0.04 0.36 5.6 52.2 CondC PE
BR00120270 0.23 0.26 0.28 0.38 42.2 48.9 CondC Thermo
BR00120270confocal 0.03 0.26 0.03 0.39 4.4 52.2 CondC Thermo
Fine-tuned model 20%
plate Training mAP model Training mAP BM Validation mAP model Validation mAP BM PR model PR BM Batch
BR00120530 0.22 0.28 0.23 0.39 57.8 58.9 CondA PE
BR00120530confocal 0.03 0.29 0.02 0.4 5.6 56.7 CondA PE
BR00120526confocal 0.02 0.29 0.02 0.36 2.2 58.9 CondA Thermo
BR00120526 0.32 0.28 0.31 0.37 75.6 56.7 CondA Thermo
BR00120536confocal 0.05 0.25 0.05 0.37 17.8 55.6 CondB PE
BR00120536 0.09 0.25 0.03 0.35 23.3 54.4 CondB PE
BR00120532confocal 0.05 0.24 0.06 0.35 10 50 CondB Thermo
BR00120532 0.27 0.23 0.21 0.38 61.1 50 CondB Thermo
BR00120274 0.22 0.23 0.21 0.34 53.3 54.4 CondC PE
BR00120274confocal 0.03 0.21 0.04 0.36 8.9 52.2 CondC PE
BR00120270 0.33 0.26 0.34 0.38 75.6 48.9 CondC Thermo
BR00120270confocal 0.03 0.26 0.04 0.39 3.3 52.2 CondC Thermo
Fine-tuned model 40%
plate Training mAP model Training mAP BM Validation mAP model Validation mAP BM PR model PR BM Batch
Fine-tune plates
BR00120526 0.35 0.28 0.34 0.37 78.9 56.7 CondA Thermo
BR00120532 0.27 0.23 0.21 0.38 57.8 50 CondB Thermo
BR00120270 0.36 0.26 0.33 0.38 78.9 48.9 CondC Thermo
Hold-out plates
BR00120530 0.26 0.28 0.22 0.39 65.6 58.9 CondA PE
BR00120530confocal 0.03 0.29 0.02 0.4 3.3 56.7 CondA PE
BR00120526confocal 0.02 0.29 0.03 0.36 3.3 58.9 CondA Thermo
BR00120536confocal 0.05 0.25 0.04 0.37 17.8 55.6 CondB PE
BR00120536 0.06 0.25 0.04 0.35 14.4 54.4 CondB PE
BR00120532confocal 0.05 0.24 0.06 0.35 15.6 50 CondB Thermo
BR00120274 0.25 0.23 0.22 0.34 54.4 54.4 CondC PE
BR00120274confocal 0.03 0.21 0.03 0.36 4.4 52.2 CondC PE
BR00120270confocal 0.03 0.26 0.03 0.39 6.7 52.2 CondC Thermo
Fine-tuned model 80%
plate Training mAP model Training mAP BM Validation mAP model Validation mAP BM PR model PR BM Batch
Fine-tune plates
BR00120526 0.37 0.28 0.39 0.37 85.6 56.7 CondA Thermo
BR00120532 0.3 0.23 0.26 0.38 78.9 50 CondB Thermo
BR00120270 0.39 0.26 0.36 0.38 87.8 48.9 CondC Thermo
Hold-out plates
BR00120530 0.25 0.28 0.23 0.39 63.3 58.9 CondA PE
BR00120530confocal 0.03 0.29 0.03 0.4 8.9 56.7 CondA PE
BR00120526confocal 0.03 0.29 0.03 0.36 2.2 58.9 CondA Thermo
BR00120536confocal 0.06 0.25 0.06 0.37 23.3 55.6 CondB PE
BR00120536 0.05 0.25 0.03 0.35 26.7 54.4 CondB PE
BR00120532confocal 0.05 0.24 0.08 0.35 17.8 50 CondB Thermo
BR00120274 0.26 0.23 0.26 0.34 60 54.4 CondC PE
BR00120274confocal 0.03 0.21 0.04 0.36 3.3 52.2 CondC PE
BR00120270confocal 0.03 0.26 0.03 0.39 8.9 52.2 CondC Thermo

XX. Possible future experiments

In this issue, I will outline experiments that may be useful in the future to further investigate the inner-workings of the model, but for which I currently do not have the time (or have low priority).

06. Final model iterations

Two cluster training data (T: S3+S4)

Some final tweaks to training the model will be made in this issue. All of these tweaks will be made with Stain2, Stain3, and Stain4 in mind at the same time, in stead of 1 at a time. The first model is trained on 3 plates from Stain3 and Stain4 at the same time and evaluated on Stain2, Stain3, and Stain4.

Main takeaways

  • It's possible to generalize to clusters outside of the trained clusters by using training data from at least two clusters at the same time.
  • This actually also improves overall performance on validation mAP for plates within the training cluster. It is slightly worse than the best model trained on the Stain2 cluster specifically.
Table Stain4
plate Training mAP model Training mAP BM Validation mAP model Validation mAP BM PR model PR BM
Training plates
BR00116625highexp 0.74 0.32 0.36 0.28 98.9 61.1
BR00116628highexp 0.73 0.32 0.32 0.31 98.9 57.8
BR00116629highexp 0.78 0.29 0.35 0.29 100 52.2
Validation plates
BR00116631highexp 0.47 0.28 0.27 0.3 93.3 53.3
BR00116625 0.6 0.31 0.35 0.29 98.9 58.9
BR00116630highexp 0.52 0.29 0.3 0.3 97.8 58.9
BR00116631 0.5 0.3 0.26 0.28 94.4 57.8
BR00116627highexp 0.55 0.31 0.38 0.27 98.9 56.7
BR00116627 0.55 0.3 0.36 0.29 96.7 56.7
BR00116629 0.61 0.3 0.32 0.29 98.9 52.2
BR00116628 0.58 0.32 0.28 0.29 98.9 58.9
Table Stain3
plate Training mAP model Training mAP BM Validation mAP model Validation mAP BM PR model PR BM
Training plates
BR00115134 0.75 0.37 0.42 0.33 98.9 58.9
BR00115125 0.75 0.36 0.44 0.29 98.9 54.4
BR00115133highexp 0.76 0.38 0.38 0.31 97.8 60
Validation plates
BR00115128highexp 0.52 0.4 0.42 0.33 97.8 58.9
BR00115125highexp 0.58 0.37 0.41 0.31 98.9 55.6
BR00115131 0.54 0.38 0.44 0.29 98.9 58.9
BR00115126 0.34 0.32 0.33 0.28 57.8 53.3
BR00115133 0.58 0.38 0.4 0.3 96.7 62.2
BR00115127 0.56 0.38 0.47 0.31 98.9 58.9
BR00115128 0.53 0.39 0.42 0.32 96.7 61.1
BR00115129 0.57 0.38 0.45 0.32 98.9 52.2
Table Stain2
plate Training mAP model Training mAP BM Validation mAP model Validation mAP BM PR model PR BM
BR00112202 0.43 0.34 0.38 0.3 88.9 54.4
BR00112197standard 0.45 0.4 0.41 0.28 85.6 56.7
BR00112198 0.43 0.35 0.4 0.3 91.1 56.7
BR00112197repeat 0.43 0.41 0.37 0.31 81.1 63.3
BR00112204 0.4 0.35 0.46 0.29 82.2 58.9
BR00112197binned 0.43 0.41 0.39 0.3 86.7 58.9
BR00112201 0.47 0.4 0.41 0.32 91.1 66.7

04. Model for Stain3

To test the generalization of the model trained on Stain2, I will now evaluate it on Stain3. Based on the results, further advancements will be made by training on plates of Stain3 (and possibly then evaluating on Stain2 in turn).

Stain 3 consists of 17 plates which were divided into 4 "batches" which are defined based on the analysis pipeline used.

  • Standard
  • Multiplane
  • HighExp
  • Bin1

To analyze the relations between the different plates in Stain3, I calculated the correlation between the PC1 loadings of the mean aggregated profiles of every plate. The BR00115130 plate stands out. This agrees with the findings in https://github.com/jump-cellpainting/pilot-analysis/issues/20 where this plate achieved the lowest PR, likely due to a lot wells having evaporated or something else being messed up. Another plate that stand out is BR00115132. This plate contains a 4-fold dilution of all dyes. BR00115130 will be left out of analysis altogether. BR00115132 will not be left out, although it will not be used as a training plate.

PC1loadings_meanProfiles

Number of cells per well per plate

NRCELLS_dist

05. Model for Stain4

To test the generalization of the model trained on Stain3 (and tested on Stain2), I will now evaluate it on Stain4. Based on the results, further advancements will be made by training on plates of Stain4 (and then evaluating on Stain2 and Stain3 in turn).

Stain 4 consists of 30 plates which were divided into 5 batches, each with different staining conditions.

  • Baseline staining conditions used in Stain3 (Stain2)
  • Baseline staining conditions used in Stain3 with 2-fold dilution of all dyes (Stain2_2)
  • Baseline staining conditions used in Stain3 with 2-fold dilution of Phalloidin (Stain2_Phalloidin_2)
  • Baseline staining conditions used in Stain3 with 2-fold dilution of Phalloidin and ConA (Stain2_Phalloidin_ConA_2)
  • Bray et al. staining conditions (Bray)

Apart from that, standard exposure vs. high exposure and Binning 1 vs. Binning 2 comparisons were also made.

To analyze the relations between the different plates in Stain4, I calculated the correlation between the PC1 loadings of the mean aggregated profiles of every plate. I only included the plates that were similar enough to form a large cluster.

Click here for clusters! PlateClustermap

Click here for cells per well per plate! Only the plates I have downloaded so far are available, but will still give a good indication of this dataset
NRCELLS_Stain4

P2 01. Testing the method on subsets of the LINCS dataset

LINCS contains 6 dose points: 0.04 µM, 0.12 µM, 0.37 µM, 1.11 µM, 3.33 µM, and 10 µM. For my experiments, I will use the highest dose (10 µM) as the training set and the validation set. The model is trained to create profiles that attract replicate compound profiles and repel non-replicate compound profiles. It is then validated by evaluating the ability of these profiles to predict MoAs (or find sister compounds). Finally, the model will be tested on the 3.3 µM dose point data as a hold-out set. This data should look significantly different from the training and validation data.

I will follow the same data exclusion protocol as Michael did in his research:

“The evaluation metrics all require the number of compounds in an MOA to be at least two (MOA class size of at least two). A compound that does not have other compounds annotated with the same MOA cannot have a precision value and does not affect Enrichment and Hit@k. Additionally, there is no use for compounds with unknown MOA labels, because I can not apply any metrics. Finally, I used only a single dose per compound to make experimental iterations practical. I picked the 10uM dose point because we found that this dose produces the strongest phenotypes across compounds without reducing the ability to group MOAs [42]. These three requirements lead to a subset of the LINCS data in which I 1) only keep the highest dose of each perturbation, 2) delete all compounds that have unknown MOA labels, and 3) drop all compounds with MOA sizes smaller than two. All further processing and experiment are based on this subset of data. This subset process is documented in my repository.”

Which should result in similar data numbers:

“The resulting (sub-selected) dataset comprises 8,818 wells spanning over 136 plates, five different batches, and 18.2 million cells. Each of the 1,144 perturbations (compounds) have five technical replicates - each on a separate plate but in the same position on the plate (same well position). Four of these technical replicate wells were produced in the same batch. This is highly relevant for training and technical artifacts, because I expect cells in the same well and same batch to have similar technical artifacts, but the corresponding wells from a different batch may look very different from one another. Each of the 136 plates has 24 negative controls (DMSO) distributed across the plate (3264 DMSOs in total). The 1,144 compounds can be categorized into 235 MOAs, and every MOA has at least two compounds (MOA class size > 1). Some MOAs, such as phosphodiesterase inhibitor, have an MOA class size of 35, while most other MOAs are only found for two compounds.”

I have put some relevant quotes from Michael's work and the LINCS manuscript here: https://docs.google.com/document/d/1z2U5o91vzBwB-4xtryYn5d3kSWJi8ZSerE_MAzdLT_0/edit#heading=h.sbi6l2r6p5ec

02. Model for CPJUMP1 + Stain2

The first results here are generated by the lastly presented model (BS64) in #1. Note that Stain2 confocal data contains 1433 features instead of the 1938 in CPJUMP1. So some features are randomly selected and repeated to feed the data to the model.

Main takeaways

As expected, the model does not directly translate to the Stain2 dataset. I expect the cause to be two-fold:

  1. Stain2 has measured less and perhaps even different features, the model has likely not learned to interpret general feature distributions yet but has overfit to the features present in CPJUMP1.
  2. Stain2 contains different compounds, which contain different feature distributions than the compounds found in CPJUMP1.

Next steps: find a way for the model to generalize to different features and compounds.

Model trained on CPJUMP1 compound data on Stain2_Batch2_Binned PR // BENCHMARK: 0.6 PS
MLP_CPJUMP1_BS64_PR

Model trained on CPJUMP1 compound data on Stain2_Batch2_Confocal PR // BENCHMARK: 0.533
MLP_CPJUMP1_BS64_Confocal_PR

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.