cytomining / profiling-recipe Goto Github PK

View Code? Open in Web Editor NEW

8.0 8.0 26.0 126 KB

Image-based Profiling Recipe

License: BSD 3-Clause "New" or "Revised" License

Shell 2.18% Python 97.82%

profiling-recipe's People

Contributors

Stargazers

Watchers

profiling-recipe's Issues

Create directories as part of a recipe step

The function scripts/create_dirs.sh can be added as the first step when running the profiling pipeline.

This will remove one required manual step that users (assay devs) will need to perform to process profiles.

Create a citation file

Create dvc for files in the gct folder

Currently, only the profiles (*.csv.gz) in the profiles folder are dvc while the files in the gct folder are still stored in GitLFS. This should be changed so that the files in the gct folder are also included while creating the .dvc files.

Include aggregation in recipe step

Currently in the JUMP recipe, the first step in the image-based profiling pipeline (aggregation) is being performed by cytominer, instead of pycytominer.

To streamline the recipe, the aggregation step should be ported to pycytominer.

Once cytomining/pycytominer#111 is merged, this step will be simple to port

Move from linear execution strategy to modular block design

Currently, the recipe (as defined in jump-cellpainting#14) is linear, with each step progressing sequentially. For example, the normalization step happens before the feature selection step.

A use case came up in the JUMP project that we will want to apply different normalization steps to the same input file, and then process the same feature selection step to both normalization output files.

This process is akin to a block design, in which each pipeline step is performed if and only if a "block" is added to the yaml config file.

A couple of implications of this enhancement:

We will need to perform a substantial refactor to introduce this change
We should explore adding the execution steps to a workflow language.
We will need to add functionality to specify the input file in each block.
Explore adding dask to the mix to enable task-graph parallelization

Create correlation histograms routinely as part of the recipe

We are already calculating an all vs all correlation heatmap as part of the QC steps; generating the replicate wells vs non replicate wells "percent replicating" histogram would be awesome as we essentially always need it.

Create per-batch summary CSV and GCT files in the recipe

For the types and sizes of batches image analysts typically work with, these are the output files we rely on most at the end. Essentially, we just need all the per-plate files concatenated, and then pycytominer's write_gct applied. Could be done either always or as a step in the config (summarize? stack?)

Recipe wish list

Since the profiling-recipe was finalized for JUMP the number of people who have interacted with the recipe has dramatically increased (JUMP members and the image analysts). Given that the recipe was always written and rewritten to satisfy the needs of the JUMP pilot experiments, I am surprised that it is robust and has, so far, not failed catastrophically. That doesn't mean that code is perfect. It needs a lot of work, particularly in

code documentation
coding consistency so that it doesn't look like a frankencode

I will work on tidying up the code so that it is easier for others to read and contribute to the codebase.

Apart from the above, other changes also need to be made to the recipe because there have been several feature requests both from before and after the version for JUMP was frozen. These feature requests lie in the spectrum of requiring minor changes to the recipe to requiring major changes to recipe and pycytominer.

I have listed all the feature requests below, with some comments and a score for how easy or difficult, it will be to implement these (1 is easy and requires the least amount of time; 5 is difficult and requires the most amount of time).

Add feature analysis
Difficulty: 3

Shantanu has written a script to visualize how different categories of features vary for perturbations and DMSO on a given plate. The script is written R and will need to be rewritten for python.

Sample images
Difficulty: 3

Shantanu has written a script to generate thumbnail montages of perturbations. Creating a montage for each well in a plate while running the workflow is valuable as it will help us answer our most asked question - What does the cell look like.

Calculate Replicate correlation and Percent Replicating
Difficulty: 2

We currently calculate the correlation between every pair of wells during the quality control > heatmap step of the recipe. In order to calculate replicate correlation and Percent Replicating the recipe would need to know which metadata column identifies the replicates. This could be added to the quality_control step.

Beth mentioned this in #29

Rename quality_control
Difficulty: 1

This block was named so because we wanted to generate plots that would tell us if there is something wrong with a plate. Now that I want to include other plots and analyses, I think we should rename this block. I don't have a new name, but I will think of one once we decide all the new plots and analyses that will go under this block.

Adding second order features
Difficulty: 5

We wanted to include this for JUMP, but due to the lack of time, we decided not to. IIUC, this would require changes to pycytominer which would mean it won't be easy to implement.

Adding dispersion order features
Difficulty: 2

We wanted to include this for JUMP, but we ran out of time

Adding replicate correlation feature selection as an option
Difficulty: 4

This would also require changes to pycytominer.
More details - https://github.com/jump-cellpainting/develop-computational-pipeline/issues/38#issuecomment-793067370

Adding git, aws cli to the conda environment
Difficulty: 1
Given that all the packages are installed using conda, it makes sense to add git and aws cli via conda as well. This is particularly helpful with ec2 instances that use outdated versions of git.

Set summary -> perform false
Difficulty: 1
I realized that not all scopes generate load_data_csv files, which is required for the summary file to be generated. Hence, the default option in the config file for perform should be false.

Automatically create the plate information in the config files
Difficulty:4
One of the most cumbersome tasks while running the recipe is to specify the names of all the batches and plates in the config file. If a user wants to run all the plates using a single config file, this information is already available in the barcode_platemap.csv file and could be added automatically to the config file. But the tricky part is making the script generic such that it can satisfy most users' needs.

Replace find and rsync steps
Difficulty: 2
Currently, these two steps are necessary when aggregation is performed outside the recipe. These two steps compress the well-level aggregated profiles and then copy them to the profiles folder. This could be implemented in the recipe, saving the user the hassle of running these two steps.

Remove features = infer from normalize and feature select
Difficulty: 1
This option exists so that the user can input their list of features instead of letting pycytominer infer the feature for the profiles. I don't see any user inputting thousands of feature names in the config file. I will remove this option from the config file and if a user wants to use their set of features, they can call pycytominer using their own script.

Profile annotation at the plate level
Difficulty: 3
When multiple types of plates (treatment, control, etc) are run in a single batch, each type of file would need a different config file because the external_metadata file is specified for all the plates in a config file. Allowing the user to set the name of the external_metadata file at the plate level will allow them to run multiple types of plates in multiple batches using the same config file.

Setting site name at the plate/batch level
Difficulty: 3
Currently, all the fields of view to aggregate have to be the same for all plates in a config file. If set at the plate level, then multiple plates with different FoVs to aggregate can be run together.

Setting input and output file names for each block
Difficulty: 4
The order of operations (aggregation, annotation, normalization and feature selection) is done in a predetermined order because the output of one operation is the input of another. By specifying the names of the input and output files, it will be possible to run the operations in any order. Until we move over to a more powerful WDL-like setup for running the workflow, this would provide the functionality of running operations in any order. This would also allow adding new annotations to profiles without running normalization and feature selection, which was requested by Anne.

Greg mentions this in #13

Here is some more context for the linear execution strategy - #11

Make the normalize block more general
Difficulty: 3
Currently, each type of normalization (whole plate and negcon) require different types of blocks (normalize and normalize_negcon). If the input and output names are allowed to be specified, then only a single type of block will be needed. The block will have a parameter to specify which type of normalization to perform (whole plate or negcon).

Combining collate.py and the recipe
Difficulty: 5
The recipe will greatly benefit from merging with collate.py because it could use collate.py's ability to run in parallel. collate.py might also benefit from the recipe because it will have a home :) and the user will be able to interact with it using the config file, instead of the command line. Also, the recipe and collate.py call the same pycytominer function, and it makes sense for the two are merged.

Create directories as part of a recipe step
Difficulty: 1
#8

Include consensus building as a recipe step
Difficulty: 2
#14

Now that you have made it through the list, there are a few questions that need to be answered

Who will implement these features? I can implement some of them, but I won't have the time to implement all of them.
Is anyone interested in contributing to the recipe?
Are there other feature request? I have captured all of Nasim's suggestions, but the other image analysts may have other feature requests.

Port JUMP progress back into profiling-recipe

@niranjchandrasekaran made a bunch of progress to the recipe in
jump-cellpainting#14 (comment)

We need to port the progress from the private JUMP project into the public profiling-recipe.

Include consensus building as a recipe step

In the same way that aggregation is currently missing from the pipeline (#9 ) forming consensus profiles should also be added.

To keep the pipeline consistent, we should also enable pycytominer.consensus.py to work with input files (aka cytomining/pycytominer#107 should be merged first)

Provide instructions on how to run the JUMP-specific profiling recipe

We should provide instructions on how to run the JUMP-specific profiling recipe. This might need to be done elsewhere, not in this repo.

It's possible what we have in "Instructions provided to JUMP partners" below has all the information we need but it still needs to be written up somewhere.

We can then have someone test-drive the instructions. The goal will be to recreate everything downstream of Chapter 5.3 in the profiling handbook, for a Cellpainting Gallery dataset (e.g., one plate of cpg0012).

Some notes on our past discussions are below.

Shantanu Singh
2 months ago
Thanks for clarifying – I hadn’t looked carefully at the difference between the JUMP instructions (Step 3 onwards; this is copied below at the end of this issue comment) and the recipe README.
So looks like the only difference is

JUMP instructions specify which commit of the recipe to use, but the recipe README does not specify it (in fact, even if we wanted to do so, the right place to do it would be in the profiling-template README, right?
JUMP instructions specify what changes to make to the config.yml, but the recipe README only says that changes to config.yml can be made (“All the necessary changes to the config file must be made before the pipeline can be run.“)
Neither are differences in the workflow per se – the first specifies which commit to use, the second specifies what config to use.
Is that correct? If so, we are all set there.

Two more questions

is it correct that the recipe – in its current form – does not attempt to do anything upstream of annotate? It’s pretty clear in the README (“Downloading the data”) but I wanted to doublecheck.
both, the recipe README as well as the handbook specify step-by-step instructions for running the recipe; would it be sensible to have the instructions only in of the two locations? If so, where should they live? I think the handbook lends itself more naturally

Niranj Chandrasekaran
2 months ago

JUMP instructions specify which commit of the recipe to use, but the recipe README does not specify it (in fact, even if we wanted to do so, the right place to do it would be in the profiling-template README.

That’s right. Currently the instructions say that we add the recipe as a submodule. We should just add another line to checkout a particular commit if we want everyone to use a specific version of the recipe.

JUMP instructions specify what changes to make to the config.yml, but the recipe README only says that changes to config.yml can be made (“All the necessary changes to the config file must be made before the pipeline can be run.“)

I guess the instructions will be dataset/project specific. Perhaps a more general version of the JUMP instructions can be added to the recipe README as recommended changes to config.yml.

is it correct that the recipe – in its current form – does not attempt to do anything upstream of annotate? It’s pretty clear in the README (“Downloading the data”) but I wanted to doublecheck.

The recipe can aggregate, given a sqlite file. But it doesn’t do it in parallel, which we may want to do for large projects. But for small projects with only a few plates, the recipe can be used for aggregation (for example - https://github.com/jump-cellpainting/pilot-cpjump1-fov-data)

_both, the recipe README as well as the handbook specify step-by-step instructions for running the recipe; would it be sensible to have the instructions only in of the two locations? If so, where should they live? I think the handbook lends itself more naturally

I initially wanted the handbook to be the go-to location for running the recipe. But there was a lot of documentation for the recipe that didn’t fit well in the handbook. Hence I started writing the README. But, I also think that the handbook should be the location for getting the step by step instructions and the README can remain as the place for getting additional information about the recipe. (edited)

Instructions provided to JUMP partners

(Copied from https://github.com/jump-cellpainting/develop-computational-pipeline/issues/52#issue-1026707736)

Step 1: Image to single cell csv

Using the pipelines and the instructions until step 5.2 of the profiling handbook, generate the single cell csv files.

Step 2: Single cell csv to well level aggregated profiles

In step 5.3, before running collate_cmd.py, checkout the commit that contains the updated collate.py code. In the first code block, after cd pycytominer, do this:

git pull
git checkout jump
git checkout b4d32d39534c949ad5165f0b98b79537c2a7ca58

Notes:

When running collate_cmd.py, use the flag --image-feature-categories="Intensity,ImageQuality,Granularity,Texture,Count,Threshold"
If you have previously run collate_cmd.py, please rerun it so that the whole-image features in the .sqlite file are added to the aggregated profiles. Don't forget to
- use the --image-feature-categories flag mentioned in 1.
- use the --aggregate-only to skip re-creating the .sqlite files
- optionally, use the --overwrite flag (and do not use the --aggregate-only flag) if you do want to recreate the .sqlite files, but typically no need to do so unless something went wrong in the creation of .sqlite files

Note: The above instructions were updated after the discussion here (Broad internal slack) and here and here.

Step 3: Aggregated profiles to annotated, normalized, feature selected profiles

After running collate.py, switch over to the instructions in the profiling-recipe repo. These instructions are similar to the ones in the workflow demo but with additional details.

Before running the profiling pipeline, issue the following commands to make sure the correct version of the profiling-recipe is used by everyone

cd profiling-recipe
git pull
git checkout 745d7627213acd9d376172e5ac716a5d4c07fbec
cd ~/work/projects/${PROJECT_NAME}/workspace/software/${DATA}/

Note: we had previously specified using 3584ceca79e83065c72a7acb021d360026ace2a2. This still works. However, we now specify using 745d7627213acd9d376172e5ac716a5d4c07fbec (because we are now able to specify the MAD Robustize fudge factor in pycytominer).

Then the following changes should be made to the config.yml for generating the profiles.

Give the pipeline a name.
Aggregation: Set perform under aggregate to false as aggregation will be performed while running collate_cmd.py
Annotation: Provide the name of the external metadata file, if it exists.
- If you do have an external_metadata.csv, set perform under external to true and specify the name of the external metadata file)
- If you do not have an external metadata file because all the metadata is already included in your platemap files, then set perform under external to false.
In the platemap.txt file, use the JCP ID as the perturbation identifier. Name this column jump-identifier. If perform under external to true, make sure to set merge_column under annotate to jump-identifier.
Normalization and feature selection: Since the code needs to know which wells contain controls, add two columns to your platemap.txt file:
(1) pert_type which should say trt for treatment wells and control for control wells
(2) control_type which should be left empty for treatment wells, and say negcon for DMSO wells and poscon for positive control wells.
Provide batch names and plate names.

General instructions:

To keep the config files easy to read, it is ok to have a different config file for each batch.
The metadata and plate map files for Target-2 plates are available - https://github.com/jump-cellpainting/JUMP-Target
You may want to have (though not strictly necessary) a different config file for the Target-2 plates and your assay plates, within each batch.

Migrate from subprocess to workflow language

Explore moving from a subprocess-based pipeline step execution strategy to a workflow language.

Some examples include:

wdl - https://github.com/openwdl/wdl
snakemake - https://snakemake.github.io/

How we will make this decision:

We will outline our needs for the immediate and future pipeline (some needs discussed in #11 )
We will determine which option best serves our needs and is most compatible with our tech stack
We will carve out sufficient, dedicated time to make an informed decision

Conda env out of date

Trying to create today, I got the following stack trace. Since other conda-forge stuff is not listed as missing, I'm guessing it's that those versions of conda and pip are no longer supported. Will make a PR with a working version but wondering if there may be reasons for these particular pins. May also be due to being on an M1.

conda env create --force --file environment.yml
Collecting package metadata (repodata.json): done
Solving environment: failed

ResolvePackageNotFound: 
  - plotly::python-kaleido=0.0.3
  - conda-forge::pip=19.2.2
  - conda-forge::python=3.7.1

Add input/output file specification in config yaml

Pertinent background information in #11 and move towards alternative execution engine in #12

One first step is to enable block design is to add input/output file specifications in the config.

Read .csv or .csv.gz

Would be nice if any time the recipe loads a .csv.gz (e.g. loading load_data.csv's) it instead checked to see if the file was there in either .csv or .csv.gz format and loaded whichever was present so that it doesn't crash if you miss compressing your input files.

Unless I'm missing some reason why this isn't a good idea?