- Step 1: Download cell images
- Step 2: Extract features using CellProfiler and DeepProfiler
- Step 3: Process the profiles using pycytominer
- Step 4: Run the analysis script
- Data Organization
- Maintenance plan
- Compute resources
- License
Table of contents generated with markdown-toc
Features from the cell images were extracted using CellProfiler and the single cell profiles were aggregated, annotated, normalized and feature selected using pycytominer. Features were also extracted using DeepProfiler which were annotated and spherized. The resulting profiles were analyzed using the notebooks in this repo. Steps for reproducing the data in this repository are outlined below.
Cell images are available on a S3 bucket. The images can be downloaded using the command
aws s3 cp \
--recursive \
s3://cellpainting-gallery/jump-pilots/source_4/images/ .
You can test out download for a single file using:
suffix=images/2020_11_04_CPJUMP1/images/BR00117010__2020-11-08T18_18_00-Measurement1/Images/r01c01f01p01-ch1sk1fk1fl1.tiff
aws s3 cp \
s3://cellpainting-gallery/jump-pilot/source_4/${suffix} \
.
See this wiki for sample Cell Painting images and the meaning of (CellProfiler-derived) Cell Painting features.
There are six batches of data - 2020_11_04_CPJUMP1
, 2020_11_18_CPJUMP1_TimepointDay1
, 2020_11_19_TimepointDay4
, 2020_12_02_CPJUMP1_2WeeksTimePoint
, 2020_12_07_CPJUMP1_4WeeksTimePoint
and 2020_12_08_CPJUMP1_Bleaching
.
51 384-well plates of cells from two cell lines treated with different types of perturbations at different time points.
Click to expand
Barcode | Description | Number_of_images |
---|---|---|
BR00118049 | A549 96-hour ORF w/ Blasticidin Plate 1 | 27648 |
BR00118050 | A549 96-hour ORF Plate 1 | 27648 |
BR00117006 | A549 96-hour ORF Plate 2 | 27648 |
BR00118039 | U2OS 96-hour ORF Plate 1 | 27648 |
BR00118040 | U2OS 96-hour ORF Plate 2 | 27648 |
BR00117020 | A549 48-hour ORF Plate 1 | 27648 |
BR00117021 | A549 48-hour ORF Plate 2 | 27648 |
BR00117022 | U2OS 48-hour ORF Plate 1 | 27648 |
BR00117023 | U2OS 48-hour ORF Plate 2 | 27648 |
BR00118041 | A549 96-hour CRISPR Plate 1 | 27560 |
BR00118042 | A549 96-hour CRISPR Plate 2 | 27648 |
BR00118043 | A549 96-hour CRISPR Plate 3 | 27648 |
BR00118044 | A549 96-hour CRISPR Plate 4 | 27648 |
BR00118045 | U2OS 96-hour CRISPR Plate 1 | 27648 |
BR00118046 | U2OS 96-hour CRISPR Plate 2 | 27640 |
BR00118047 | U2OS 96-hour CRISPR Plate 3 | 27648 |
BR00118048 | U2OS 96-hour CRISPR Plate 4 | 27648 |
BR00117003 | A549 144-hour CRISPR Plate 1 | 27632 |
BR00117004 | A549 144-hour CRISPR Plate 2 | 27648 |
BR00117005 | A549 144-hour CRISPR Plate 3 | 27568 |
BR00117000 | A549 144-hour CRISPR Plate 4 | 27640 |
BR00117002 | A549 144-hour CRISPR w/ Puromycin Plate 1 | 27648 |
BR00117001 | A549 144-hour CRISPR w/ Puromycin Plate 2 | 27648 |
BR00116997 | U2OS 144-hour CRISPR Plate 1 | 27648 |
BR00116998 | U2OS 144-hour CRISPR Plate 2 | 27648 |
BR00116999 | U2OS 144-hour CRISPR Plate 3 | 27648 |
BR00116996 | U2OS 144-hour CRISPR Plate 4 | 27648 |
BR00116991 | A549 24-hour Compound Plate 1 | 27648 |
BR00116992 | A549 24-hour Compound Plate 2 | 27640 |
BR00116993 | A549 24-hour Compound Plate 3 | 27352 |
BR00116994 | A549 24-hour Compound Plate 4 | 27576 |
BR00116995 | U2OS 24-hour Compound Plate 1 | 27648 |
BR00117024 | U2OS 24-hour Compound Plate 2 | 27648 |
BR00117025 | U2OS 24-hour Compound Plate 3 | 27648 |
BR00117026 | U2OS 24-hour Compound Plate 4 | 27648 |
BR00117017 | A549 48-hour Compound Plate 1 | 49144 |
BR00117019 | A549 48-hour Compound Plate 2 | 49152 |
BR00117015 | A549 48-hour Compound Plate 3 | 49144 |
BR00117016 | A549 48-hour Compound Plate 4 | 49152 |
BR00117012 | U2OS 48-hour Compound Plate 1 | 27648 |
BR00117013 | U2OS 48-hour Compound Plate 2 | 27648 |
BR00117010 | U2OS 48-hour Compound Plate 3 | 27648 |
BR00117011 | U2OS 48-hour Compound Plate 4 | 27648 |
BR00117054 | A549 48-hour +20% Seed Density Compound Plate 1 | 27648 |
BR00117055 | A549 48-hour +20% Seed Density Compound Plate 2 | 27648 |
BR00117008 | A549 48-hour -20% Seed Density Compound Plate 1 | 27648 |
BR00117009 | A549 48-hour -20% Seed Density Compound Plate 2 | 27648 |
BR00117052 | A549 Cas9 48-hour Compound Plate 1 | 27648 |
BR00117053 | A549 Cas9 48-hour Compound Plate 2 | 27648 |
BR00117050 | A549 Cas9 48-hour Compound Plate 3 | 27648 |
BR00117051 | A549 Cas9 48-hour Compound Plate 4 | 27648 |
Eight ORF plates imaged one day after staining.
Click to expand
Barcode | Description | Number_of_images |
---|---|---|
BR00118050 | A549 96-hour ORF Plate 1 | 12280 |
BR00117006 | A549 96-hour ORF Plate 2 | 12272 |
BR00118039 | U2OS 96-hour ORF Plate 1 | 12288 |
BR00118040 | U2OS 96-hour ORF Plate 2 | 12280 |
BR00117020 | A549 48-hour ORF Plate 1 | 12288 |
BR00117021 | A549 48-hour ORF Plate 2 | 12288 |
BR00117022 | U2OS 48-hour ORF Plate 1 | 12288 |
BR00117023 | U2OS 48-hour ORF Plate 2 | 12288 |
Eight ORF plates imaged four days after staining.
Click to expand
Barcode | Description | Number_of_images |
---|---|---|
BR00118050 | A549 96-hour ORF Plate 1 | 12288 |
BR00117006 | A549 96-hour ORF Plate 2 | 12288 |
BR00118039 | U2OS 96-hour ORF Plate 1 | 12288 |
BR00118040 | U2OS 96-hour ORF Plate 2 | 12288 |
BR00117020 | A549 48-hour ORF Plate 1 | 12288 |
BR00117021 | A549 48-hour ORF Plate 2 | 12288 |
BR00117022 | U2OS 48-hour ORF Plate 1 | 12208 |
BR00117023 | U2OS 48-hour ORF Plate 2 | 12280 |
Eight ORF plates imaged 14 days after staining.
Click to expand
Barcode | Description | Number_of_images |
---|---|---|
BR00118050 | A549 96-hour ORF Plate 1 | 27648 |
BR00117006 | A549 96-hour ORF Plate 2 | 27648 |
BR00118039 | U2OS 96-hour ORF Plate 1 | 27648 |
BR00118040 | U2OS 96-hour ORF Plate 2 | 27648 |
BR00117020 | A549 48-hour ORF Plate 1 | 27320 |
BR00117021 | A549 48-hour ORF Plate 2 | 27632 |
BR00117022 | U2OS 48-hour ORF Plate 1 | 27648 |
BR00117023 | U2OS 48-hour ORF Plate 2 | 27648 |
Eight ORF plates imaged 28 days after staining.
Click to expand
Barcode | Description | Number_of_images |
---|---|---|
BR00118050 | A549 96-hour ORF Plate 1 | 27648 |
BR00117006 | A549 96-hour ORF Plate 2 | 27648 |
BR00118039 | U2OS 96-hour ORF Plate 1 | 27648 |
BR00118040 | U2OS 96-hour ORF Plate 2 | 27648 |
BR00117020 | A549 48-hour ORF Plate 1 | 27648 |
BR00117021 | A549 48-hour ORF Plate 2 | 27640 |
BR00117022 | U2OS 48-hour ORF Plate 1 | 27648 |
BR00117023 | U2OS 48-hour ORF Plate 2 | 27648 |
Four compound plates imaged and additional six times (A
, B
, C
, D
, E
and F
).
Click to expand
Barcode | Description | Number_of_images |
---|---|---|
BR00116991A | A549 24-hour Compound Plate 1 | 21504 |
BR00116992A | A549 24-hour Compound Plate 2 | 21504 |
BR00116993A | A549 24-hour Compound Plate 3 | 27648 |
BR00116994A | A549 24-hour Compound Plate 4 | 27648 |
BR00116991B | A549 24-hour Compound Plate 1 | 21504 |
BR00116992B | A549 24-hour Compound Plate 2 | 21504 |
BR00116993B | A549 24-hour Compound Plate 3 | 27648 |
BR00116994B | A549 24-hour Compound Plate 4 | 27648 |
BR00116991C | A549 24-hour Compound Plate 1 | 21504 |
BR00116992C | A549 24-hour Compound Plate 2 | 21504 |
BR00116993C | A549 24-hour Compound Plate 3 | 27648 |
BR00116994C | A549 24-hour Compound Plate 4 | 27648 |
BR00116991D | A549 24-hour Compound Plate 1 | 21504 |
BR00116992D | A549 24-hour Compound Plate 2 | 21504 |
BR00116993D | A549 24-hour Compound Plate 3 | 27648 |
BR00116994D | A549 24-hour Compound Plate 4 | 27648 |
BR00116991E | A549 24-hour Compound Plate 1 | 21504 |
BR00116992E | A549 24-hour Compound Plate 2 | 21504 |
BR00116993E | A549 24-hour Compound Plate 3 | 27648 |
BR00116994E | A549 24-hour Compound Plate 4 | 27648 |
BR00116991F | A549 24-hour Compound Plate 1 | 21504 |
BR00116992F | A549 24-hour Compound Plate 2 | 21504 |
BR00116993F | A549 24-hour Compound Plate 3 | 27648 |
BR00116994F | A549 24-hour Compound Plate 4 | 27648 |
The folder for each 384-well plate typically contains images from nine sites for each well (for some wells 7,8 or 16 sites were imaged).
The (x,y) coordinates of sites are available in the Metadata_PositionX
and Metadata_PositionY
columns of the load_data.csv.gz
files in the load_data_csv
folder.
There are eight images per site (five from the fluorescent channels and three brightfield images).
The names of the image files follow the naming convention - rXXcXXfXXp01-chXXsk1fk1fl1.tiff
where
rXX
is the row number of the well that was imaged.rXX
ranges fromr01
tor16
.cXX
is the column number of the well that was imaged.cXX
ranges fromc01
toc24
.fXX
corresponds to the site that was imaged.fXX
ranges fromf01
tof16
.chXX
corresponds to the fluorescent channels imaged.chXX
ranges fromch01
toch08
.ch01
- Alexa 647ch02
- Alexa 568ch03
- Alexa 488 longch04
- Alexa 488ch05
- Hoechst 33342ch06-8
- three brighfield z planes.
Cell bounding boxes and segmentation masks have not been provided.
Plate map and Metadata are available in the metadata/
folder and also from https://github.com/jump-cellpainting/JUMP-Target.
Use the CellProfiler pipelines in pipelines/2020_11_04_CPJUMP1
and follow the instructions in the profiling handbook up until chapter 5.3 to generate the well-level aggregated CellProfiler profiles from the cell images.
Follow the README.md to extract features from a pretrained neural network using DeepProfiler
Pycytominer adds metadata from metadata/moa
to the well-level aggregated profiles, normalizes the profiles to the whole plate and to the negative controls, separately and filters out invariant and redundant features.
To reproduce the profiles, clone this repo, download the files and activate the conda environment, after installing Miniconda, with the commands
git clone https://github.com/jump-cellpainting/neurips-cpjump1
cd neurips-cpjump1
git lfs pull
git submodule update --init --recursive
conda env create --force --file environment.yml
conda activate profiling
Then run the pycytominer workflow with the command
./run.sh
This creates the profiles in the profiles/
folder for all the plates in each batch. The folder for each plate contains the following files
File name | Description |
---|---|
<plate_ID>.csv.gz |
Aggregated profiles |
<plate_ID>_augmented.csv |
Metadata annotated profiles |
<plate_ID>_normalized.csv.gz |
MAD robustized to whole plate profiles |
<plate_ID>_normalized_negcon.csv.gz |
MAD robustized to negative control profiles |
<plate_ID>_normalized_feature_select_plate.csv.gz |
Feature selected normalized to whole plate profiles |
<plate_ID>_normalized_feature_select_negcon_plate.csv.gz |
Feature selected normalized to negative control profiles |
The benchmark scripts compute Percent Replicating
which is a measure of signature strength, Percent Matching across modalities
which is a measure of how well the chemical and genetic perturbation profiles match. These metrics are calculated using the Feature selected normalized to negative control profiles
(well-level profiles).
In the case of features extracted using DeepProfiler, the annotated features are spherized and Percent Replicating
is calculated on these profiles.
To run the benchmark script activate the conda environment in benchmark/
conda env create --force --file benchmark/environment.yml
conda activate benchmark
Then run the jupyter notebooks (benchmark/0.percent_matching.ipynb
and benchmark/1.percent_matching_across_modalities.ipynb
) to create the figures in benchmark/figues/
and the tables in benchmark/README.md
.
The following is the description of contents of the relevant folders in this repo.
benchmark
- contains the notebooks for reproducing the benchmark scores and figuresconfig_files
- contains the config files required for processing the profiles with pycytominerdatasplit
- contains the recommended data splitsdeep_profiles
- contains the config files, functions and instructions for extracting DeepProfiler featuresexample_images
- contains single-site, all channel images from ten example wellsload_data_csv
- contains file location and other image metadata for each plate in all batchesmetadata
- contains the perturbation metadata and plate mapspipelines
- contains the CellProfiler pipelines for cell segmentation and feature extractionprofiles
- contains both CellProfiler and DeepProfiler extracted features for all batchesprofiling-recipe
- contains the scripts that for running the pycytominer pipeline for processing profilesvisualization
- contains notebooks for generating plate map and clinical phase status visualization figuresenvironment.yml
- conda environment for running pycytominer pipelinerun.sh
- runs the pycytominer pipeline for processing profilesmaintenace_plan.md
- contains our maintenance plan for this dataset
We have provided our maintenance plan in maintenance_plan.md.
For segmentation and feature extraction, each plate of images took on average 30 minutes to process, using a fleet of 200 m4.xlarge spot instances (800 vCPUs), which cost approximately $10 per plate. Aggregation into mean profiles takes 12-18 hours, though can be parallelized onto a single large machine, at the total cost of <$1 per plate. For profile processing with pycytominer, each plate took under two minutes, using a local machine (Intel Core i9 with 16 GB memory)
DeepProfiler took around 8 hours to extract features from ~280.000 images in a p3.2xlarge with a single Tesla V100-SXM2 GPU. Note that cells location were previously precomputed with the CellProfiler segmentation pipeline.
We use a dual license in this repository. We license the source code as BSD 3-Clause, and license the data, results, and figures as CC0 1.0.