facebookresearch / cpa Goto Github PK

The Compositional Perturbation Autoencoder (CPA) is a deep generative framework to learn effects of perturbations at the single-cell level. CPA performs OOD predictions of unseen combinations of drugs, learns interpretable embeddings, estimates dose-response curves, and provides uncertainty estimates.

License: MIT License

Python 0.43% Jupyter Notebook 99.56% Shell 0.02%

cpa's Introduction

CPA - Compositional Perturbation Autoencoder

This code in not being maintained anymore, please use the new implementation here.

What is CPA?

CPA is a framework to learn effects of perturbations at the single-cell level. CPA encodes and learns phenotypic drug response across different cell types, doses and drug combinations. CPA allows:

Out-of-distribution predicitons of unseen drug combinations at various doses and among different cell types.
Learn interpretable drug and cell type latent spaces.
Estimate dose response curve for each perturbation and their combinations.
Access the uncertainty of the estimations of the model.

Package Structure

The repository is centered around the cpa module:

cpa.train contains scripts to train the model.
cpa.api contains user friendly scripts to interact with the model via scanpy.
cpa.plotting contains scripts to plotting functions.
cpa.model contains modules of cpa model.
cpa.data contains data loader, which transforms anndata structure to a class compatible with cpa model.

Additional files and folders:

datasets contains both versions of the data: raw and pre-processed.
preprocessing contains notebooks to reproduce the datasets pre-processing from raw data.

Usage

As a first step, download the contents of datasets/ and pretrained_models/ from this tarball.

To learn how to use this repository, check ./notebooks/demo.ipynb, and the following scripts:

Note that hyperparameters in the demo.ipynb are set as default but might not work work for new datasets.

Examples and Reproducibility

you can find more example and hyperparamters tuning scripts and also reproducbility notebooks for the plots in the paper in the reproducibility repo.

Curation of your own data to train CPA

To prepare your data to train CPA, you need to add specific fields to adata object and perfrom data split. Examples on how to add necessary fields for multiple datasets used in the paper can be found in preprocessing/ folder.

Training a model

There are two ways to train a cpa model:

Using the command line, e.g.: python -m cpa.train --data datasets/GSM_new.h5ad --save_dir /tmp --max_epochs 1 --doser_type sigm
From jupyter notebook: example in ./notebooks/demo.ipynb

Documentation

Currently you can access the documentation via help function in IPython. For example:

from cpa.api import API

help(API)

from cpa.plotting import CPAVisuals

help(CPAVisuals)

A separate page with the documentation is coming soon.

Support and contribute

If you have a question or noticed a problem, you can post an issue.

Reference

Please cite the following publication if you find CPA useful in your research.

@article{lotfollahi2023predicting,
  title={Predicting cellular responses to complex perturbations in high-throughput screens},
  author={Lotfollahi, Mohammad and Klimovskaia Susmelj, Anna and De Donno, Carlo and Hetzel, Leon and Ji, Yuge and Ibarra, Ignacio L and Srivatsan, Sanjay R and Naghipourfar, Mohsen and Daza, Riza M and Martin, Beth and others},
  journal={Molecular Systems Biology},
  pages={e11517},
  year={2023}
}

The paper titled Predicting cellular responses to complex perturbations in high-throughput screens can be found [here](https://www.biorxiv.org/content/10.1101/2021.04.14.439903v2](https://www.embopress.org/doi/full/10.15252/msb.202211517).

License

This source code is released under the MIT license, included here.

cpa's People

Contributors

Stargazers

Watchers

cpa's Issues

required input schema for customized datasets

Hi,
Thanks for making this public!
I am wondering what is the required schema for the input in order to train CPA on our own datasets?

Issues regarding the model training process

Dear authors,

Thank you very much for sharing this repo. This is a very interesting algorithm and I think the paper is very helpful. Currently I am trying to train the model using your provided datasets, however, when I run the command line "python -m cpa.train --data datasets/GSM_new.h5ad --save_dir /tmp --max_epochs 1 --doser_type sigm", I always get an error "AssertionError: Covariate c is missing in the provided adata". If I follow your codes in the notebook, I will get the error "module 'cpa' has no attribute 'api'". It seems that you may get some updated version of the module "cpa"?

I am currently doing research regarding the out of sample prediction problems for perturbation data. I think your method should be very helpful and I really looking forward to successfully running the algorithm. Hope I can get your response regarding this issue, thank you very much for your time.

Best,
Hongxu

Different key names in the pretrained model used in trapnell.ipynb

Hi,

I downloaded the tar file and tried to run trapnell.ipynb.
The model in the notebook was not there: 'pretrained_models/sweep_sciplex3_old_reproduced_logsigm_model_seed=30_epoch=340.pt'

I picked another existing one: 'pretrained_models/sweep_sciplex3_prepared_logsigm/model_seed=30_epoch=120.pt'
But it gives me key error when running the next line:
model, datasets = prepare_cpa(args, state_dict=state)

250     datasets = load_dataset_splits(

--> 251 args["data"],
252 args["perturbation_key"],
253 args["dose_key"],

KeyError: 'data'

It looks like the keys in this pretrained model are different from the one used in the notebook:
args
{'dataset_path': 'datasets/sciplex3_prepared.h5ad',
'perturbation_key': 'condition',
'dose_key': 'dose_val',
'cell_type_key': 'cell_type',
'loss_ae': 'gauss',
'doser_type': 'logsigm',
'seed': 30,
'hparams': '',
'max_epochs': 2000,
'patience': 20,
'checkpoint_freq': 20,
'save_dir': '/checkpoint/klanna/sweep_sciplex3_prepared_logsigm',
'sweep_seeds': 200}

I wonder if I used the wrong version of the pretrained model to reproduce this notebook?

Thanks,
Yun-Ching

Bug in sciplex3 preprocessing?

Hi there,

I've been trying to reproduce the training on sciplex but I get this error with a brand new clone, datasets and conda env:

$ python -m compert.train --dataset_path datasets/sciplex3_new.h5ad       --save_dir /tmp --max_epochs 1  --doser_type sigm

Traceback (most recent call last):
  File "/home/kcvc236/miniconda3/envs/CPAvanilla/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/kcvc236/miniconda3/envs/CPAvanilla/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/kcvc236/CPAvanilla/CPA/compert/train.py", line 303, in <module>
    train_compert(parse_arguments())
  File "/home/kcvc236/CPAvanilla/CPA/compert/train.py", line 197, in train_compert
    autoencoder, datasets = prepare_compert(args)
  File "/home/kcvc236/CPAvanilla/CPA/compert/train.py", line 167, in prepare_compert
    datasets = load_dataset_splits(
  File "/home/kcvc236/CPAvanilla/CPA/compert/data.py", line 189, in load_dataset_splits
    "training": dataset.subset("train", "all"),
  File "/home/kcvc236/CPAvanilla/CPA/compert/data.py", line 129, in subset
    return SubDataset(self, idx)
  File "/home/kcvc236/CPAvanilla/CPA/compert/data.py", line 161, in __init__
    self.ctrl_name = dataset.ctrl_name[0]
IndexError: list index out of range

I have strong suspicion that there is a problem in the preprocessing of sciplex:
https://github.com/facebookresearch/CPA/blob/main/preprocessing/sciplex3.ipynb

The cell #6 is probably causing the troubles by making it impossible for adata.obs.control to be anything else than 0. Hence the error above.

Do you have a working version or fix you could share for this please?

Cheers

Question in the paper

Dear Authors,
Thank you for making the paper and code open source. It is very helpful.

With respect to the image above - 2 steps are being done for adversarial learning - one term where the ld and lc are positive and the next point they are negative. Why not use a gradient reversal layer before the perturbation and covariate discriminator instead of the 2 step process, so that the loss can be back-propagated in a single forward and backward pass? Or this is just a design choice? I am just curious.

Am I missing something? Please let me know.

Thank you,
Megh

Request of 'sweep_sciplex3_old_reproduced_logsigm_model_seed=30_epoch=340.pt'

Hi,

To reproduce the result of sciplex3, I run notebooks/trapnell.ipynb. The notebook loads model from 'pretrained_models/sweep_sciplex3_old_reproduced_logsigm_model_seed=30_epoch=340.pt'. However, the tarball you provided (https://dl.fbaipublicfiles.com/dlp/cpa_binaries.tar) does not contain this file. I tried with 'pretrained_models/sciplex3_new/sweep_sciplex3_new_logsigm_model_seed=106_epoch=40.pt' and some other models provided in the tarball, but failed to load the model. Could you please provide a copy of 'sweep_sciplex3_old_reproduced_logsigm_model_seed=30_epoch=340.pt'? I guess this is the correct file to reproduce the result in the paper.

Thank you in advance!

one-hot encoder of control

Hi, I was wondering if we should encode "control" as a vector with all zeros, instead of a regular one-hot vector similar to other drugs?
I believe so since by doing so we can simply use the learned z_basal to represent the embeddings of control cells. Let me know if I am wrong.

Also, is it possible to have your example data used in the notebooks?

raw/processed datasets available?

Hi, thank you for making this public!
I am interested in trying out the model but I didn't find the step for downloading the raw dataset. I am wondering if the preprocessed dataset will be available in the datasets folder sometime soon?

Thank you in advance!

datasets question

Dear Authors, thanks for your fantastic work, I followed your jupyter and learned how to calculate response, I had read the code line by line, I found all the h5ad file except kang.h5ad from the datasets. Can you provide the kang.h5ad file in the datasets?

Many Thanks

Hyperparameter Sweep

Hi,

Thanks for creating this package, it looks incredibly interesting! I wasn't able to find the sweep module for CPA for hyperparameter tuning for new datasets anywhere in the repo, has that been moved somewhere else?

Thanks!
Yan

dose response

Dear Authors, thanks for your fantastic work, I followed your jupyter and learned how to calculate response, I had read the code line by line, I knew how we could get the value of response, but I could not understand why we calculate response in this way, could you please help to explain it in bioinformatics since I had no background in this area