Giter VIP home page Giter VIP logo

pycytominer's Introduction

Data processing for image-based profiling

Build Status Coverage Status Code style: black RTD

Pycytominer is a suite of common functions used to process high dimensional readouts from high-throughput cell experiments. The tool is most often used for processing data through the following pipeline:

Description of the pycytominer pipeline. Images flow from feature extraction and are processed with a series of steps

Click here for high resolution pipeline image

Image data flow from a microscope to cell segmentation and feature extraction tools (e.g. CellProfiler or DeepProfiler). From here, additional single cell processing tools curate the single cell readouts into a form manageable for pycytominer input. For CellProfiler, we use cytominer-database or CytoTable. For DeepProfiler, we include single cell processing tools in pycytominer.cyto_utils.

From the single cell output, pycytominer performs five steps using a simple API (described below), before passing along data to cytominer-eval for quality and perturbation strength evaluation.

Installation

You can install pycytominer via pip:

pip install pycytominer

or conda:

conda install -c conda-forge pycytominer

Frameworks

Pycytominer is primarily built on top of pandas, also using aspects of SQLAlchemy, sklearn, and pyarrow.

Pycytominer currently supports parquet and compressed text file (e.g. .csv.gz) i/o.

API

Pycytominer has five major processing functions:

  1. Aggregate - Average single-cell profiles based on metadata information (most often "well").
  2. Annotate - Append metadata (most often from the platemap file) to the feature profile
  3. Normalize - Transform input feature data into consistent distributions
  4. Feature select - Exclude non-informative or redundant features
  5. Consensus - Average aggregated profiles by replicates to form a "consensus signature"

The API is consistent for each of these functions:

# Each function takes as input a pandas DataFrame or file path
# and transforms the input data based on the provided options and methods
df = function(
    profiles_or_path,
    features,
    samples,
    method,
    output_file,
    additional_options...
)

Each processing function has unique arguments, see our documentation for more details.

Usage

The default way to use pycytominer is within python scripts, and using pycytominer is simple and fun.

# Real world example
import pandas as pd
import pycytominer

commit = "da8ae6a3bc103346095d61b4ee02f08fc85a5d98"
url = f"https://media.githubusercontent.com/media/broadinstitute/lincs-cell-painting/{commit}/profiles/2016_04_01_a549_48hr_batch1/SQ00014812/SQ00014812_augmented.csv.gz"

df = pd.read_csv(url)

normalized_df = pycytominer.normalize(
    profiles=df,
    method="standardize",
    samples="Metadata_broad_sample == 'DMSO'"
)

Pipeline orchestration

Pycytominer is a collection of different functions with no explicit link between steps. However, some options exist to use pycytominer within a pipeline framework.

Project Format Environment pycytominer usage
Profiling-recipe yaml agnostic full pipeline support
CellProfiler-on-Terra WDL google cloud / Terra single-cell aggregation
CytoSnake snakemake agnostic full pipeline support

A separate project called AuSPICES offers pipeline support up to image feature extraction.

Other functionality

Pycytominer was written with a goal of processing any high-throughput image-based profiling data. However, the initial use case was developed for processing image-based profiling experiments specifically. And, more specifically than that, image-based profiling readouts from CellProfiler measurements from Cell Painting data.

Therefore, we have included some custom tools in pycytominer/cyto_utils that provides other functionality:

Note, pycytominer.cyto_utils.cells.SingleCells() contains code to interact with single-cell SQLite files, which are output from CellProfiler. Processing capabilities for SQLite files depends on SQLite file size and your available computational resources (for ex. memory and cores).

CellProfiler CSV collation

If running your images on a cluster, unless you have a MySQL or similar large database set up then you will likely end up with lots of different folders from the different cluster runs (often one per well or one per site), each one containing an Image.csv, Nuclei.csv, etc. In order to look at full plates, therefore, we first need to collate all of these CSVs into a single file (currently SQLite) per plate. We currently do this with a library called cytominer-database.

If you want to perform this data collation inside pycytominer using the cyto_utils function collate (and/or you want to be able to run the tests and have them all pass!), you will need cytominer-database==0.3.4; this will change your installation commands slightly:

# Example for general case commit:
pip install "pycytominer[collate]"

# Example for specific commit:
pip install "pycytominer[collate] @ git+https://github.com/cytomining/pycytominer@77d93a3a551a438799a97ba57d49b19de0a293ab"

If using pycytominer in a conda environment, in order to run collate.py, you will also want to make sure to add cytominer-database=0.3.4 to your list of dependencies.

Creating a cell locations lookup table

The CellLocation class offers a convenient way to augment a LoadData file with X,Y locations of cells in each image. The locations information is obtained from a single cell SQLite file.

To use this functionality, you will need to modify your installation command, similar to above:

# Example for general case commit:
pip install "pycytominer[cell_locations]"

Example using this functionality:

metadata_input="s3://cellpainting-gallery/test-cpg0016-jump/source_4/workspace/load_data_csv/2021_08_23_Batch12/BR00126114/test_BR00126114_load_data_with_illum.parquet"
single_single_cell_input="s3://cellpainting-gallery/test-cpg0016-jump/source_4/workspace/backend/2021_08_23_Batch12/BR00126114/test_BR00126114.sqlite"
augmented_metadata_output="~/Desktop/load_data_with_illum_and_cell_location_subset.parquet"

python \
    -m pycytominer.cyto_utils.cell_locations_cmd \
    --metadata_input ${metadata_input} \
    --single_cell_input ${single_single_cell_input}   \
    --augmented_metadata_output ${augmented_metadata_output} \
    add_cell_location

# Check the output

python -c "import pandas as pd; print(pd.read_parquet('${augmented_metadata_output}').head())"

# It should look something like this (depends on the width of your terminal):

#   Metadata_Plate Metadata_Well Metadata_Site  ...                                   PathName_OrigRNA ImageNumber                                        CellCenters
# 0     BR00126114           A01             1  ...  s3://cellpainting-gallery/cpg0016-jump/source_...           1  [{'Nuclei_Location_Center_X': 943.512129380054...
# 1     BR00126114           A01             2  ...  s3://cellpainting-gallery/cpg0016-jump/source_...           2  [{'Nuclei_Location_Center_X': 29.9516027655562...

Generating a GCT file for morpheus

The software morpheus enables profile visualization in the form of interactive heatmaps. Pycytominer can convert profiles into a .gct file for drag-and-drop input into morpheus.

# Real world example
import pandas as pd
import pycytominer

commit = "da8ae6a3bc103346095d61b4ee02f08fc85a5d98"
plate = "SQ00014812"
url = f"https://media.githubusercontent.com/media/broadinstitute/lincs-cell-painting/{commit}/profiles/2016_04_01_a549_48hr_batch1/{plate}/{plate}_normalized_feature_select.csv.gz"

df = pd.read_csv(url)
output_file = f"{plate}.gct"

pycytominer.cyto_utils.write_gct(
    profiles=df,
    output_file=output_file
)

pycytominer's People

Contributors

adeboyeml avatar alxndrkalinin avatar axiomcura avatar bethac07 avatar bunnech avatar d33bs avatar erinweisbart avatar gwaybio avatar hillsbury avatar jenna-tomkinson avatar johnarevalo avatar kenibrewer avatar michaelbornholdt avatar niranjchandrasekaran avatar roshankern avatar rsenft1 avatar ruifanp avatar shntnu avatar sjfleming avatar staylorx avatar vincerubinetti avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.