Giter VIP home page Giter VIP logo

cytominer-eval's Introduction

Cytominer-eval: Evaluating quality of perturbation profiles

Actions Status Documentation Status Coverage Status Code style: black

Cytominer-eval contains functions to calculate quality metrics for perturbation profiling experiments.

Installation

Cytominer-eval can be installed via pip:

pip install cytominer-eval

Since the project is actively being developed, to get up to date functionality, you can also install via github commit hash:

# Example
pip install git+git://github.com/cytomining/cytominer-eval@f7f5b293da54d870e8ba86bacf7dbc874bb79565

Usage

Cytominer-eval uses a simple API for all evaluation metrics.

# Working example
import pandas as pd
from cytominer_eval import evaluate

# Load Data
commit = "6f9d350badd0a18b6c1a76171813aaf9a52f8d9f"
url = f"https://github.com/cytomining/cytominer-eval/raw/{commit}/cytominer_eval/example_data/compound/SQ00015054_normalized_feature_select.csv.gz"

df = pd.read_csv(url)

# Define important function arguments
meta_features = df.columns[df.columns.str.startswith("Metadata_")]
features = df.drop(meta_features, axis="columns").columns.tolist()
replicate_groups = ["Metadata_broad_sample", "Metadata_mg_per_ml"]

# Evaluate profile quality
evaluate(
    profiles=df,
    features=features,
    meta_features=meta_features,
    replicate_groups=replicate_groups,
    replicate_reproducibility_return_median_cor=False,
    operation="replicate_reproducibility",
)

Metrics

Currently, five metric operations are supported:

  1. Replicate reproducibility
  2. Precision/recall
  3. mp-value
  4. Grit
  5. Enrichment
  6. Hit@k

Demos

For more in depth tutorials, see https://github.com/cytomining/cytominer-eval/tree/master/demos.

cytominer-eval's People

Contributors

alxndrkalinin avatar gwaybio avatar hillsbury avatar koalive avatar michaelbornholdt avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

cytominer-eval's Issues

Option to not have to recalculate melted similarity matrix

Currently, the only option in evaluate() is to provide a pandas dataframe with metadata and profile features. The function then calculates and melts the similarity matrix.

We should provide an option to evaluate precomputed melted similarity matrices to speed up computation.

Define precision and recall @ k

(Stubs for now, so we can add this documentation to code later)

Precision@k can be computed in many different contexts:

  1. For each replicate well, the fraction of replicates of the same perturbations, among the k most-similar wells.
  2. For each guide, the fraction of guides targeting the same genes, among the k most-similar guides.
  3. For each compound (in the context of MOA), the fraction of compounds in the same MOA class, among the k most-similar compounds.

Recall@k is defined similarly e.g. the fraction of all the replicates that are among the k most-similar wells.

R API for "plyr-ing" similarity matrices

@gwaygenomics just a heads up that the code I've been using to test things out here (private) and here is evolving towards getting a consistent API. I'm wary of creating a package – it will be yet another bifurcation in our efforts – so I am actively avoiding doing so (as you can see, a single R file!) while still following R package principles (docstrings, explicit namespaces).

Let's make a decision on what to do once we've written up the benchmark paper. Definitely no R package for now, and we can see where we are post-submission and figure out plans then. Ideally, I'd just switch over to Python and life will be simpler :D But that will really slow me down!

Define percent strong

(Stubs for now, so we can add this documentation to code later)

Percent strong is reported in two ways. We should distinguish between these ways of reporting (they are similar but not the same)

  1. The fraction of replicate pairs that are more similar to each other than 95% of non-replicate pairs.
  2. The fraction of replicate sets that are more coherent than 95% of non-replicate sets of the same size. The coherence of a set is defined as the median correlation among elements of the set.

The second version can be a bit confusing so here is an example:

  • Consider an experiment that has 100 perturbations, performed in 5 replicates
  • For each perturbation, we compute the pairwise correlation among its 5 replicates (the replicate set), and report the median value. This median replicate correlation value is the coherence of the perturbation's replicates.
  • We now compute the coherence of non-replicate sets of the same size. So we sample 5 random wells in the experiment such that no two are replicates of the same perturbation. We compute the pairwise correlation among the 5 wells (the non-replicate set), and report the median value. We do this several times to build a null distribution
  • We now report the fraction of perturbations that have a coherence greater than 95th percentile of the null. This is the percent strong.

Remaining tests for the new precision recall

Here is a list of ideas on what tests to add to the precision recall function. This is needed because #63 is only focussed on results and not on good tests.

  • Use a small example as Florian has written in #62 to make sure that the math is correct
  • Make sure that the output df has the right shape depending on the number of groupby_cols
  • Add checks that make sure the groupby is available, ie it exists in the df.

Test which is about replicate groups, not about groupby_cols

  • Make sure that if the replicate groups span a unique mapping over the input df, then all values should be 0. This is the case because no connection will have a true label.

`precision_recall()` counts within group connections twice but not between group connections

When reading the source code for cytominer_eval.operations.precision_recall() I noticed that the similarity_melted_df variable counts each replicate pair twice, e.g. A1 --> A2 and A2 --> A1.
This becomes a problem because only the first replicate_group_col in lines 49-52 is subsequently used for grouping:

49 replicate_group_cols = [
50    "{x}{suf}".format(x=x, suf=pair_ids[list(pair_ids)[0]]["suffix"])  # [0] keeps only the first of two grouping columns
51    for x in replicate_groups
52 ]

In the next step, each group is passed to calculate_precision_recall():

59   precision_recall_df_at_k = similarity_melted_df.groupby(
60        replicate_group_cols
61    ).apply(lambda x: calculate_precision_recall(x, k=k_))
62    precision_recall_df = precision_recall_df.append(precision_recall_df_at_k)

With the effect that all samples from within a group are counted twice. However, samples from outside the group are only counted once because group_by will filter out one direction.

Let me clarify this with an example. Consider 5 samples, the first 3 from group 'A', the second 2 from group 'B', both with greater within-group than between group correlations:

image

Then what calculate_precision_recall will see is this:
image

For example, one can see that the sample_pair_a column has a row for A1-->A2 and one for A2-->A1 but only one for A1-->B1. B1-->A1 is missing because of the way the melted data frame is generated and the grouping is performed. One can also see that the similarity metrics for within group connections appear in duplicates.

Accordingly the outcome for precision and recall at k=4 is the following:
image
Precision: all 4 closest connections are from within group for A but only 2 for group B.
Recall: 4/6 connections found for A but all 2 found for B.

In summary, the computations are not entirely correct, especially for smaller groups. Also consider that with odd values for k only one of the two connections of the symmetric pair is used.

Admittedly, this is a bit mind-boggling. I recommend using a debugger if you want to trace all the steps in detail by yourself.

Proposed solution: I would suggest to count each pair only once when creating the melted data frame.

Define grit

(Stubs for now, so we can add this documentation to code later)

Grit can be computed in many different contexts:

A replicate's grit is its average similarity to other replicates, z-scored against its similarity to negative control replicates.

A guide's grit is its average similarity to other guides targeting the same gene, z-scored against its similarity to negative control guides.

A compound's grit (in the context of MOA) is its average similarity to other compounds in the same MOA class, z-scored against its similarity to DMSO replicates.

Add Precision at R to precision at k

This update should be done soon after #63 since it is fast to implement and important for me to use for my thesis.

I will simply add the option to do precision at R which will set k to the number of correct neighbors such that the precision can always be between 0 and 1 no matter how many MOA replicas exist for a given class.

percent_strong outputs nan when no replicates are present

When no replicates are present, denom = 0 and we receive the following error:

/home/ubuntu/efs/Periscope_Calico/workspace/software/Periscope_Calico/miniconda3/envs/pooled-cell-painting/lib/python3.7/site-packages/cytominer_eval/operations/percent_strong.py:50: RuntimeWarning: invalid value encountered in long_scalars
  ).sum() / denom

We should add an AssertionError when denom = 0 and provide an error message that "no replicate groups identified".

The line is

replicate_df = similarity_melted_df.query("group_replicate")
denom = replicate_df.shape[0]
percent_strong = (
replicate_df.similarity_metric > non_replicate_quantile
).sum() / denom

Figure out why percent replicating is apparently insensitive to number of replicates

@gwaygenomics reported this in https://github.com/broadinstitute/lincs-profiling-complementarity

Additionally, because we had 5 replicates of Cell Painting and 3 replicates of L1000 per treatment, we performed a subsampling experiment with Cell Painting to match the number of L1000 replicates. We actually observed slightly increased percent replicating scores in the subsampled Cell Painting data with fewer replicates indicating that the two additional Cell Painting replicates did not artificially inflate percent replicating scores (Supplementary Figure 3).

supfigure3

Running metric_melt several times

This is an optimization issue.
When running metrics such as precision-recall or enrichment, like in the demo book: https://github.com/cytomining/cytominer-eval/blob/master/demos/CellPainting_Demo.ipynb
You call evaluate several times to calculate the metric at different value of p e.g.

for p in np.arange(0.99, 0.96, -0.01):
    r = evaluate(
        profiles=df,
        features=features,
        meta_features=meta_features,
        replicate_groups=["Metadata_gene_name"],
        operation="enrichment",
        similarity_metric="pearson",
        enrichment_percentile = p,
    )
    result.append(r)

However, this calls the function cytominer_eval.transform.get_pairwise_metric several times (once per call of evaluate). This is not necessary since the metrics can be retrieved from the same similarity_melted_df!!

We need to adapt the function evaluate such that it either only calculates the pairs once, when called several times. Or we maybe just need to change the demos to show that when calculating several values, you must not use evaluate for them.

What are your thoughts @gwaygenomics

Skip calculating similarity matrix for mp-value

The mp_value function operates on the input profiles, and not the similarity melted dataframe.

elif operation == "mp_value":
metric_result = mp_value(
df=profiles,
control_perts=grit_control_perts,
replicate_id=replicate_groups,
features=features,
params=mp_value_params,
)

We should skip this calculation when operation == "mp_value":

similarity_melted_df = metric_melt(
df=profiles,
features=features,
metadata_features=meta_features,
similarity_metric=similarity_metric,
eval_metric=operation,
)

Upgrade to minimum pandas 1.2 and deprecate python 3.5

At some point in the future, we should consider making the minimum pandas version >=1.2 in the requirements.txt file.

I debated performing this in #55 because of one silly deprecating in pandas.testing.assert_frame_equal (rtol argument, see 03c09ee). However, I decided that this version upgrade and deprecation should probably be thought through in more detail.

One idea is to mirror the version requirements of pycytominer since both of them are likely going to be used together often. We should also think about certain projects that might no longer be able to use cytominer-eval after this decision, although I think that should be a non-issue since those other projects are probably using a specific cytominer-eval hash.

calculate_precision_recall() fails when replicate_groups has just one member

In operations.py calculate_precision_recall() is called in line 67. When the replicate_groups has just one member, this fails. Before I investigate this in more detail: should we return NA in such cases with a warning or simply throw a more informative exception? At the moment a division by zero error is raised without additional information.

Further optimizations

also, I'm wondering if you can optimize this (probably don't need to change for this PR - it is likely beyond scope), but if you split out the two queries, is it faster?

e.g.

nongroup_replicate_query = replicate_truth_df.query("not group_replicate")
group_replicate_query = replicate_truth_df.query("group_replicate")

v11 = group_replicate_query("similarity_metric > @Threshold")

and so on...

optmize enrichment

Add https link for cytominer-eval installation

In the readme front page the link for the most recent cytominer-eval installation is given as

pip install git+git://github.com/cytomining/cytominer-eval@f7f5b293da54d870e8ba86bacf7dbc874bb79565

However, my company server only allows https access. For such cases the link should be

pip install git+https://github.com/cytomining/cytominer-eval@f7f5b293da54d870e8ba86bacf7dbc874bb79565

Please add this to the readme file and include it also in other similar installation instructions. :)

Issues with groupby columns

Soo,

For prec_recall and Hitk we know have the case that the input groupby columns determines by which columns the similarity df is sorted. This has a important impact on your solution. If you for example sort by something that is not unique, ie not unique in the input df - then you will get internal connections in the sub dataframe that you are grouping.

Lets say you have for example a df with Sampels and different dosages. If you then have groupby_columns = Metadata_broad_sample, then you will sort into sub groups that have several connections within each other (all the different doses). And your precision will have the weird effects that @FloHu described in #62 for example. Similarly, hitk will have weird results because you are now looking at internal connections and not only the nearest neighbors of one sample.

Either we keep it all this way and make users aware of this or we find some workaround here? Maybe the solution is to not allow anything other than unique groupby_cols ?

Adding comparison to mp-value

Hi there!
@gwaygenomics as we discussed I am looking on adding the multidimensional perturbation value (mp-value, DOI:
10.1177/1087057112469257) as another metric to which you could compare the grit score. This actually sounds like solving #23 (the mp-value is basically the Mahalanobis distance to the negative controls on the PCA-reduced space).
To clarify and make sure this would fit in this project, could you please confirm that these are the steps it would require?

  • In cytominer_eval.operations define a new operation mp_value.
  • In test-evaluate.py, add a method test_evaluate_mp_value.
  • In evaluate.py, add a case for operation == "mp_value".

Cheers!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.