Giter VIP home page Giter VIP logo

cello's Introduction

CellO: Cell Ontology-based classification   alt text

PyPI Version

About

CellO (Cell Ontology-based classification) is a Python package for performing cell type classification of human RNA-seq data. CellO makes hierarchical predictions against the Cell Ontology. These classifiers were trained on nearly all of the human primary cell, bulk RNA-seq data in the Sequence Read Archive.

For more details regarding the underlying method, see the paper: Bernstein, M.N., Ma, J., Gleicher, M., Dewey, C.N. (2020). CellO: Comprehensive and hierarchical cell type classification of human cellswith the Cell Ontology. iScience, 24(1), 101913.

There are two modes in which one can use CellO: within Python in conjunction with Scanpy, or with the command line.

Installation

To install CellO using Pip, run the following command:

pip install cello-classify

Running CellO from within Python

CellO's API interfaces with the Scanpy Python library and can integrate into a more general single-cell analysis pipeline. For an example on how to use CellO with Scanpy, please see the tutorial.

This tutorial can also be executed from a Google Colab notebook in the cloud: https://colab.research.google.com/drive/1lNvzrP4bFDkEe1XXKLnO8PZ83StuvyWW?usp=sharing.

Running CellO from the command line

CellO takes as input a gene expression matrix. CellO accepts data in multiple formats:

  • TSV: tab-separated value
  • CSV: comma-separated value
  • HDF5: a database in HDF5 format that includes three datasets: a dataset storing the expression matrix, a dataset storing the list of gene-names (i.e. rows), and a gene-set storing the list of cell ID's (i.e. columns)
  • 10x formatted directory: a directory in the 10x format including three files: matrix.mtx, genes.tsv, and barcodes.tsv

Given an output-prefix provided to CellO (this can include the path to the output), CellO outputs three tables formatted as tab-separated-value files:

  • <output_prefix>.probability.tsv: a NxM classification probability table of N cells and M cell types where element (i,j) is a probability value that describes CellO's confidence that cell i is of cell type j
  • <output_prefix>.binary.tsv: a NxM binary-decision matrix where element (i,j) is 1 if CellO predicts cell i to be of cell type j and is 0 otherwise.
  • <output_prefix>.most_specific.tsv: a table mapping each cell to the most-specific predicted cell
  • <output_prefix>.log: a directory that stores log files that store details of CellO's execution
  • <output_prefix>.log/genes_absent_from_training_set.tsv: if a new model is trained using the -t option, then this file will store the genes in CellO's training set that were not found in the input dataset
  • <output_prefix>.log/clustering.tsv: a TSV file mapping each cell to its assigned cluster. Note, that if pre-computed clusters are provided via the -p option, then this file will not be written.

Usage:

cello_predict [options] input_file

Options:
  -h, --help            show this help message and exit
  -a ALGO, --algo=ALGO  Hierarchical classification algorithm to apply
                        (default='IR'). Must be one of: 'IR' - Isotonic
                        regression, 'CLR' - cascaded logistic regression
  -d DATA_TYPE, --data_type=DATA_TYPE
                        Data type (required). Must be one of: 'TSV', 'CSV',
                        '10x', or 'HDF5'. Note: if 'HDF5' is used, then
                        arguments must be provided to the h5_cell_key,
                        h5_gene_key, and h5_expression_key parameters.
  -c H5_CELL_KEY, --h5_cell_key=H5_CELL_KEY
                        The key of the dataset within the input HDF5 file
                        specifying which dataset stores the cell ID's.  This
                        argument is only applicable if '-d HDF5' is used
  -g H5_GENE_KEY, --h5_gene_key=H5_GENE_KEY
                        The key of the dataset within the input HDF5 file
                        specifying which dataset stores the gene names/ID's.
                        This argument is only applicable if '-d HDF5' is used
  -e H5_EXPRESSION_KEY, --h5_expression_key=H5_EXPRESSION_KEY
                        The key of the dataset within the input HDF5 file
                        specifying which dataset stores the expression matrix.
                        This argument is only applicable if '-d HDF5' is used
  -r, --rows_cells      Use this flag if expression matrix is organized as
                        CELLS x GENES rather than GENES x CELLS. Not
                        applicable when '-d 10x' is used.
  -u UNITS, --units=UNITS
                        Units of expression. Must be one of: 'COUNTS', 'CPM',
                        'LOG1_CPM', 'TPM', 'LOG1_TPM'
  -s ASSAY, --assay=ASSAY
                        Sequencing assay. Must be one of: '3_PRIME',
                        'FULL_LENGTH'
  -t, --train_model     If the genes in the input matrix don't match what is
                        expected by the classifier, then train a classifier on
                        the input genes. The model will be saved to
                        <output_prefix>.model.dill
  -m MODEL, --model=MODEL
                        Path to pretrained model file.
  -l REMOVE_ANATOMICAL, --remove_anatomical=REMOVE_ANATOMICAL
                        A comma-separated list of terms ID's from the Uberon
                        Ontology specifying which tissues to use to filter
                        results. All cell types known to be resident to the
                        input tissues will be filtered from the results.
  -p PRE_CLUSTERING, --pre_clustering=PRE_CLUSTERING
                        A TSV file with pre-clustered cells. The first column
                        stores the cell names/ID's (i.e. the column names of
                        the input expression matrix) and the second column
                        stores integers referring to each cluster. The TSV
                        file should not have column names.
  -b, --ontology_term_ids
                        Use the less readable, but more rigorous Cell Ontology
                        term id's in output
  -o OUTPUT_PREFIX, --output_prefix=OUTPUT_PREFIX
                        Prefix for all output files. This prefix may contain a
                        path.

Notably, the input expression data's genes must match the genes expected by the trained classifier. If the genes match, then CellO will use a pre-trained classifier to classify the expression profiles (i.e. cells) in the input dataset.

To provide an example, here is how you would run CellO on a toy dataset stored in example_input/Zheng_PBMC_10x. This dataset is a set of 1,000 cells subsampled from the Zheng et al. (2017) dataset. To run CellO on this dataset, run this command:

cello_predict -d 10x -u COUNTS -s 3_PRIME example_input/Zheng_PBMC_10x -o test

Note that -o test specifies the all output files will have the prefix "test". The -d specifies the input format, -u specifies the units of the expression matrix, and -s specifies the assay-type. For a full list of available formats, units, assay-types, run:

cello_predict -h

Running CellO with a gene set that is incompatible with a pre-trained model

If the genes in the input file do not match the genes on which the model was trained, CellO can be told to train a classifier with only those genes included in the given input dataset by using the -t flag. The trained model will be saved to a file named <output_prefix>.model.dill where <output_prefix> is the output-prefix argument provided via the -o option. Training CellO usually takes under an hour.

For example, to train a model and run CellO on the file example_input/LX653_tumor.tsv, run the command:

cello_predict -u COUNTS -s 3_PRIME -t -o test example_input/LX653_tumor.tsv

Along with the classification results, this command will output a file test.model.dill.

Running CellO with a custom model

Training a model on a new gene set needs only to be done once (see previous section). For example, to run CellO on example_input/LX653_tumor.tsv using a specific model stored in a file, run:

cello_predict -u COUNTS -s 3_PRIME -m test.model.dill -o test example_input/LX653_tumor.tsv

Note that -m test.model.dill tells CellO to use the model computed in the previous example.

Quantifying reads with Kallisto to match CellO's pre-trained models

We provide a commandline tool for quantifying raw reads with Kallisto. Note that to run this script, Kallisto must be installed and available in your PATH environment variable. This script will output an expression profile that includes all of the genes that CellO is expecting and thus, expression profiles created with this script are automatically compatible with CellO.

This script requires a preprocessed kallisto reference. To download the pre-built Kallisto reference that is compatible with CellO, run the command:

bash download_kallisto_reference.sh

This command will download a directory called kallisto_reference in the current directory. To run Kallisto on a set of FASTQ files, run the command

cello_quantify_sample <comma_dilimited_fastq_files> <tmp_dir> -o <kallisto_output_file>

where <comma_delimited_fastq_files> is a comma-delimited set of FASTQ files containing all of the reads for a single RNA-seq sample and <tmp_dir> is the location where Kallisto will store it's output files. The file <kallisto_output_file> is a tab-separated-value table of the log(TPM+1) values that can be fed directly to CellO. To run CellO on this output file, run:

cell_predict -u LOG1_TPM -s FULL_LENGTH <kallisto_output_file> -o <cell_output_prefix>

Note that the above command assumes that the assay is a full-length assay (meaning reads can originate from the full-length of the transcript). If this is a 3-prime assay (reads originate from only the 3'-end of the transcript), the -s FULL_LENGTH should be replaced with -s 3_PRIME in the above command.

Trouble-shooting

If upon running pip install cello you receive an error installing Cython, that looks like:

ERROR: Command errored out with exit status 1:
     command: /scratch/cdewey/test_cello/CellO-master/cello_env/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-wo2dj5q7/quadprog/setup.py'"'"'; __file__='"'"'/tmp/pip-install-wo2dj5q7/quadprog/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base pip-egg-info
         cwd: /tmp/pip-install-wo2dj5q7/quadprog/
    Complete output (5 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-wo2dj5q7/quadprog/setup.py", line 17, in <module>
        from Cython.Build import cythonize
    ModuleNotFoundError: No module named 'Cython'
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

then you may try upgrading to the latest version of pip and Cython by running:

python -m pip install --upgrade pip
pip install --upgrade cython

cello's People

Contributors

ann-holmes avatar lustigeperson avatar mbernste avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

cello's Issues

InvalidIndexError: Reindexing only valid with uniquely valued Index objects

`(317/317)
Training classifier for label CL:0001044...
Number of positive items: 2
Number of negative items: 4032
done.
Writing trained model to test.model.dill
Found CellO resources at 'cello/resources'.

InvalidIndexError Traceback (most recent call last)
in
6 rsrc_loc=cello_resource_loc,
7 out_prefix=model_prefix,
----> 8 log_dir=os.getcwd()
9 )

~/.conda/envs/genomics/lib/python3.7/site-packages/cello/scanpy_cello.py in cello(adata, clust_key, rsrc_loc, algo, out_prefix, model_file, log_dir, term_ids, remove_anatomical_subterms)
148 rsrc_loc=rsrc_loc,
149 log_dir=log_dir,
--> 150 remove_anatomical_subterms=remove_anatomical_subterms
151 )
152

~/.conda/envs/genomics/lib/python3.7/site-packages/cello/cello.py in predict(ad, mod, algo, clust_key, log_dir, remove_anatomical_subterms, rsrc_loc)
235 algo=algo,
236 clust_key=clust_key,
--> 237 log_dir=log_dir
238 )
239

~/.conda/envs/genomics/lib/python3.7/site-packages/cello/cello.py in _raw_probabilities(ad, mod, algo, clust_key, log_dir)
415 # Shuffle columns to be in accordance with model
416 features = mod.classifier.features
--> 417 ad = ad[:,features]
418
419 # Cluster

~/.conda/envs/genomics/lib/python3.7/site-packages/anndata/_core/anndata.py in getitem(self, index)
1106 def getitem(self, index: Index) -> "AnnData":
1107 """Returns a sliced view of the object."""
-> 1108 oidx, vidx = self._normalize_indices(index)
1109 return AnnData(self, oidx=oidx, vidx=vidx, asview=True)
1110

~/.conda/envs/genomics/lib/python3.7/site-packages/anndata/_core/anndata.py in _normalize_indices(self, index)
1087
1088 def _normalize_indices(self, index: Optional[Index]) -> Tuple[slice, slice]:
-> 1089 return _normalize_indices(index, self.obs_names, self.var_names)
1090
1091 # TODO: this is not quite complete...

~/.conda/envs/genomics/lib/python3.7/site-packages/anndata/_core/index.py in _normalize_indices(index, names0, names1)
34 ax0, ax1 = unpack_index(index)
35 ax0 = _normalize_index(ax0, names0)
---> 36 ax1 = _normalize_index(ax1, names1)
37 return ax0, ax1
38

~/.conda/envs/genomics/lib/python3.7/site-packages/anndata/_core/index.py in _normalize_index(indexer, index)
96 return positions # np.ndarray[int]
97 else: # indexer should be string array
---> 98 positions = index.get_indexer(indexer)
99 if np.any(positions < 0):
100 not_found = indexer[positions < 0]

~/.conda/envs/genomics/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_indexer(self, target, method, limit, tolerance)
3170 if not self.is_unique:
3171 raise InvalidIndexError(
-> 3172 "Reindexing only valid with uniquely valued Index objects"
3173 )
3174

InvalidIndexError: Reindexing only valid with uniquely valued Index objects`

no output files

Hi CellO team,

Training classifier for label CL:0001058... Number of positive items: 4 Number of negative items: 4249 (316/317) Training classifier for label CL:0000767... Number of positive items: 4 Number of negative items: 3696 (317/317) Training classifier for label CL:0001044... Number of positive items: 2 Number of negative items: 4032 done. Writing trained model to cello.model.dill Normalizing counts... done. Killed

Thanks for contributing to this tool. I wanted to apply to a pbmc dataset (GSE144744), but found that there was no output files except a model.drill (~300M), and a cell.log folder containing only genes_absent_from_training_set.tsv. Seems it stopped from the normalization step seen above.

Looking forward to your reply.

About preprocessor of PCA in pre-train

Hi CellO team,
I am very interesting in your CellO.
There are a few confusions : (1) Whether the train_X of the model uses PCA result pc1...pc3000 as a feature(Is the picture A or B?)
(2) If is B, How do you find the corresponding pc1...pc3000 in a single cell as a feature of L2LogisticRegression?
image

Could you provide a ftp link to the CellO resources?

Hi,
I wanna know more about the source you use to train your data to make a predecision on how to use this tool.
Maybe I don't find a proper citation to CellO, so I cannot know before I implement your tool. Could I have a look on the source (and/or process) you used to train the model?

Thanks!

KeyError: 'leiden' when setting the parameter clust_key

reproduce
use cello.scanpy_cello with clust_key set to existing obs column that is not 'leiden'

expected behavior
cello would use the set clustering key

observed behaviour
cello crashes with: KeyError: 'leiden'

identified bug
cello.py line 429
ad_clust = _combine_by_cluster(ad)

should be

ad_clust = _combine_by_cluster(ad, clust_key)

Erroneous assignments on 10k PBMC from 10x

Hi,

Thank you for developing this tool - I think this is a very neat and useful idea, especially given how Cell Ontology seems to be gaining traction. I included it in the Scanpy tutorial I'm putting together at our Institute that involves various things that are not covered by the vignettes on the Scanpy website. I was using 10k PBMC dataset I obtained from 10x Genomics

I noticed that a couple of fairly well-defined cell types are annotated with errors. Here's the "most specific cell type" annotation plot:

image

These, however, are the classical markers used for the PBMC sub-populations. FCGR3A is non-classical monocytes, FCER1A is myeloid DCs, and LILRA4 is plasmacytoid DCs:

image

You can see that the model mistook non-classical monocytes for dendritic cells, and annotated both subtypes of DCs as "mononuclear cell" which is probably not what's intended too.

I can provide the whole notebook if that would make it easier to diagnose the problem (but I did pretty much follow the tutorial, with the addition of .todense() transformation).

fix resource download URL

Hi, the download function invokes curl without following redirects, and the deweylab.biostat.wisc.edu server does not support http URLs. Please add an s to make it https. Thank you!

ValueError: Unable to determine gene collection

I receive the error message,

ValueError: Unable to determine gene collection. Please make sure the input dataset specifies either HUGO gene symbols or Entrez gene ID's.

W1_1.var.head()
gene_ids feature_types highly_variable means dispersions dispersions_norm n_cells mt rb n_cells_by_counts mean_counts pct_dropout_by_counts total_counts
Mrpl15 ENSMUSG00000033845 Gene Expression False 0.523846 1.410226 -0.317858 364 False False 364 0.125276 76.815287 196.682587
Lypla1 ENSMUSG00000025903 Gene Expression False 0.496954 1.360324 -0.604688 356 False False 356 0.117622 77.324841 184.666626
Tcea1 ENSMUSG00000033813 Gene Expression False 1.178549 1.267944 -0.668772 814 False False 814 0.410102 48.152866 643.860168
Atp6v1h ENSMUSG00000033793 Gene Expression False 0.555859 1.411481 -0.310645 389 False False 389 0.138000 75.222930 216.659286
Rb1cc1 ENSMUSG00000025907 Gene Expression False 1.298682 1.432109 0.011839 838 False False 838 0.449731 46.624204 706.077637

the gene_ids should be Entrez gene ids? I change the column name gene_ids to Entrez gene IDs or Gene stable ID or Entrez gene ids, but didn't work.
or what code should I use to map the gene ids to Ensembl BioMart (http://useast.ensembl.org/biomart) ?

Thanks
Ting

A little advice

Hi, thank you for developing this useful tool.
There are two things I want to report.

  1. I encountered an error and found that it was because adata.X were "stored elements in Compressed Sparse Row format" sometimes. And at this time, cello.scanpy_cello can not work. (It can be solved by adata.X.toarray())
  2. I found cello.scanpy_cello has a useful parameter remove_anatomical_subterms. And I think perhaps we want to limit the cell types (eg. lung, liver or breast) other than to filter out some other types in general?

About the PCA loadings matrix U

Hi CellO team,
Your ideal has attracted me deeply!
I want to ask you about the details.
(1)Is m in the picture the number of genes or the number of cell types?
(2)I think it should be the number of genes instead of the number of cell types mentioned above. If it is the number of cell types, U will not be able to multiply Xi, because the length of Xi is the number of genes, right?
(3)Is the U corresponding to each cell type the same?
image

Error when install CellO

Hello all,
I run:

pip install pygraphviz leidenalg scanpy cello-classify

And this was what I got:

 pygraphviz/graphviz_wrap.c:2711:10: fatal error: 'graphviz/cgraph.h' file not found
  #include "graphviz/cgraph.h"
           ^~~~~~~~~~~~~~~~~~~
  1 error generated.
  error: command '/usr/bin/clang' failed with exit code 1
  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure

× Encountered error while trying to install package.
╰─> pygraphviz

note: This is an issue with the package mentioned above, not pip.
hint: See above for output from the failure.

[notice] A new release of pip available: 22.2.2 -> 22.3
[notice] To update, run: python3.9 -m pip install --upgrade pip
(cello_env)

Would you please suggest how to fix this? Thank you so much!

sparsematrix toarray in cello.py

I encountered this error when I run scanoy_cello

File /mnt/data/hong/anaconda3/envs/scanpy/lib/python3.10/site-packages/cello/cello.py:454, in _aggregate_expression(X)
    448 def _aggregate_expression(X):
    449     """
    450     Given a matrix of log(TPM+1) where rows correspond to cells
    451     and columns correspond to genes, aggregate the counts
    452     to form a psuedo-bulk expression profile.
    453     """
--> 454     X = (np.exp(X)-1) / 1e6
    455     x_clust = np.sum(X, axis=0)
    456     sum_x_clust = float(sum(x_clust))

TypeError: loop of ufunc does not support argument 0 of type SparseCSRView which has no callable exp method

I added

from scipy.sparse import issparse
## in function _combine_by_cluster
    X_clust = ad[cells,:].copy().X
    X_clust = X_clust.toarray() if issparse(X_clust) else X_clust

in cello.py to deal with this error.

Any best solution?
Best,

Error after not finding pre-training models (ir.dill)

Hello everyone.

I just installed the latest cello-classify version with pip. Version 2.0.3. Installation went fine.

When I try to use it, I get an error.

I execute this cell in a JupyterLab notebook:

cello.scanpy_cello(adata0a,
                   clust_key='leiden',
                   rsrc_loc='/mnt/b0/compbio-ebs-01/igr/d0/220317_000000/130',
                   out_prefix='/mnt/b0/compbio-ebs-01/igr/d0/220317_000000/130/cello_model00',
                   log_dir='/mnt/b0/compbio-ebs-01/igr/d0/220317_000000/130')

and this is the error:

       Could not find the CellO resources directory called
        'resources' in '/mnt/b0/compbio-ebs-01/igr/d0/220317_000000/130'. Will download resources to current 
        directory.
        
Running command: curl http://deweylab.biostat.wisc.edu/cell_type_classification/resources_v2.0.0.tar.gz > /mnt/b0/compbio-ebs-01/igr/d0/220317_000000/130/resources_v2.0.0.tar.gz
Running command: tar -C /mnt/b0/compbio-ebs-01/igr/d0/220317_000000/130 -zxf resources_v2.0.0.tar.gz
Running command: rm /mnt/b0/compbio-ebs-01/igr/d0/220317_000000/130/resources_v2.0.0.tar.gz
Checking if any pre-trained model is compatible with this input dataset...
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-45-2570c77cb08a> in <module>
      4 #sc.pp.neighbors(adata0a, n_neighbors=15)
      5 #sc.tl.leiden(adata0a, resolution=2.0)
----> 6 cello.scanpy_cello(adata0a,
      7                    clust_key='leiden',
      8                    rsrc_loc='/mnt/b0/compbio-ebs-01/igr/d0/220317_000000/130',

/usr/local/lib/python3.9/site-packages/cello/scanpy_cello.py in cello(adata, clust_key, rsrc_loc, algo, out_prefix, model_file, log_dir, term_ids, remove_anatomical_subterms)
    127     else:
    128         # Load or train a model
--> 129         mod = ce._retrieve_pretrained_model(adata, algo, rsrc_loc)
    130         if mod is None:
    131             mod = ce.train_model(

/usr/local/lib/python3.9/site-packages/cello/cello.py in _retrieve_pretrained_model(ad, algo, rsrc_loc)
    329                 model_fname
    330             )
--> 331             with open(model_f, 'rb') as f:
    332                 mod = dill.load(f)
    333             feats = mod.classifier.features

FileNotFoundError: [Errno 2] No such file or directory: '/mnt/b0/compbio-ebs-01/igr/d0/220317_000000/130/resources/trained_models/ir.dill'

Could somebody offer some guidance?

Thank you.

Ivan

loop of ufunc does not support argument 0

Hello Matt,

When trying to replicate the scanpy protocol with my Raw data I get the following error after the classifier finishes:

TypeError: loop of ufunc does not support argument 0 of type SparseCSRView which has no callable exp method

Have you encountered this before?

Many Thanks,
Brian

support layer selection for anndata

Hi,

anndata has layers to save more information. I'm wondering if scanpy_cello can support the selection of layers to use. Thus we don't need to save separate object files to test.

Best,

resources are downloaded even if -f specified

running cello_predict with -f 'path to folder which contains resources' i get the message

Loading data from <>
Loaded data matrix with 4245 cells and 36601 genes.
Loading model from <>
Normalizing counts...
done.
Could not find the CellO resources directory called <cwd>

Looking at the code perhaps rsrc_loc is not being passed through to run_cello and cello.predict?

why every cluster is predicted as "oxygen accumulating cell (CL:0000329)"?

Hi,
I have a dataset from GSE123814 and I'm trying to re-analyze them using CellO. I normalize the data using the typical approach in your tutorial, and finish cello.scanpy_cello() without error. However, all the ~30 clustered were predicted as oxygen accumulating cell (CL:0000329).
I also tried several other datasets, and it seems that CellO only works for PBMC data, but not other tissue types.
Could you please comment on why this might happen and how to adjust the training for other tissue types?
Thanks,
Hurley

KeyError in scanpy_cello(); most specific cell type not created

Hi,
Thank you for developing this very useful tool!
We encountered an error while trying to classify our cells with a pre-trained model. The prediction itself seems to work and we get the binary and probability output for the ontology terms added to the adata object if we set term_ids=True, however the selection of the 'most specific cell type' fails with an KeyError and the conversion to readable terms does also not work (no output at all if term_ids=False).
Any ideas why this could happen and how to fix it?
Thanks in advance,
Laura and Philipp

The command:

cello.scanpy_cello(
    adata, 
    'clusters',
    cello_resource_loc, 
    model_file=f'{model_prefix}.model.dill',
    term_ids=True
)

Output:

Found CellO resources at '/data/home/EBgrant/scRNA_run1/analysis/Laura/CellO/resources'.

Variable names are not unique. To make them unique, call `.var_names_make_unique`.

Transforming with PCA...
done.
Making predictions for each classifier...
Running solver on item 1/19...
Running solver on item 2/19...
Running solver on item 3/19...
Running solver on item 4/19...
Running solver on item 5/19...
Running solver on item 6/19...
Running solver on item 7/19...
Running solver on item 8/19...
Running solver on item 9/19...
Running solver on item 10/19...
Running solver on item 11/19...
Running solver on item 12/19...
Running solver on item 13/19...
Running solver on item 14/19...
Running solver on item 15/19...
Running solver on item 16/19...
Running solver on item 17/19...
Running solver on item 18/19...
Running solver on item 19/19...
Checking if any pre-trained model is compatible with this input dataset...

/opt/anaconda3/envs/scRNAseq/lib/python3.6/site-packages/sklearn/base.py:315: UserWarning: Trying to unpickle estimator PCA from version 0.22.2.post1 when using version 0.24.2. This might lead to breaking code or invalid results. Use at your own risk.
  UserWarning)
/opt/anaconda3/envs/scRNAseq/lib/python3.6/site-packages/sklearn/base.py:315: UserWarning: Trying to unpickle estimator LogisticRegression from version 0.22.2.post1 when using version 0.24.2. This might lead to breaking code or invalid results. Use at your own risk.
  UserWarning)

Of 24458 genes in the input file, 19107 were found in the training set of 58243 genes.
Of 24458 genes in the input file, 18496 were found in the training set of 31283 genes.
Using thresholds stored in /data/home/EBgrant/scRNA_run1/analysis/Laura/CellO/resources/trained_models/ir.10x_genes_thresholds.tsv
Binarizing classifications...
Mapping each sample to its predicted labels...
Computing the most-specific predicted labels...
Item 1 predicted to be "somatic cell (CL:0002371)"
Item 2 predicted to be "somatic cell (CL:0002371)"
Item 3 predicted to be "somatic cell (CL:0002371)"
Item 4 predicted to be "neuron associated cell (CL:0000095)"
Item 5 predicted to be "astrocyte (CL:0000127)"
Item 6 predicted to be "astrocyte (CL:0000127)"
Item 7 predicted to be "somatic cell (CL:0002371)"
Item 8 predicted to be "CNS neuron (sensu Vertebrata) (CL:0000117)"
Item 9 predicted to be "astrocyte (CL:0000127)"
Item 10 predicted to be "hepatocyte (CL:0000182)"
Item 11 predicted to be "neurecto-epithelial cell (CL:0000710)"
Item 12 predicted to be "neural cell (CL:0002319)"
Item 13 predicted to be "astrocyte (CL:0000127)"
Item 14 predicted to be "astrocyte (CL:0000127)"
Item 15 predicted to be "squamous epithelial cell (CL:0000076)"
Item 16 predicted to be "astrocyte (CL:0000127)"
Item 18 predicted to be "neuron associated cell (CL:0000095)"
Item 19 predicted to be "endothelial cell of umbilical vein (CL:0002618)"

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-28-1600827178e8> in <module>
      7     cello_resource_loc,
      8     model_file=f'{model_prefix}.model.dill',
----> 9     term_ids=True
     10 )

/opt/anaconda3/envs/scRNAseq/lib/python3.6/site-packages/cello/scanpy_cello.py in cello(adata, clust_key, rsrc_loc, algo, out_prefix, model_file, log_dir, term_ids, remove_anatomical_subterms)
    206         adata.obs['Most specific cell type'] = [
    207             ou.cell_ontology().id_to_term[c].name
--> 208             for c in ms_results_df['most_specific_cell_type']
    209         ]
    210     else:

/opt/anaconda3/envs/scRNAseq/lib/python3.6/site-packages/cello/scanpy_cello.py in <listcomp>(.0)
    206         adata.obs['Most specific cell type'] = [
    207             ou.cell_ontology().id_to_term[c].name
--> 208             for c in ms_results_df['most_specific_cell_type']
    209         ]
    210     else:

KeyError: ''

Error when Transforming with PCA...

Hello DeweyLab,

I am working with a dataset of ~22k cells and ~22k genes, and after generating a model with cello.scanpy_cello(), I keep getting an error when transforming with PCA:

Transforming with PCA...
Traceback (most recent call last):
File "/Users/name/Documents/Bio Scripts/Python/scRNAseq & RNAseq Analysis/Human/Skeletal Muscle/SkelMuscleAnalysis", line 178, in
cello.scanpy_cello(adata, clust_key='leiden', rsrc_loc=cello_resource_loc, model_file=f'{two_model_prefix}.model.dill')
File "/Users/name/opt/anaconda3/envs/p-scRNAseq/lib/python3.9/site-packages/cello/scanpy_cello.py", line 146, in cello
results_df, finalized_binary_results_df, ms_results_df = ce.predict(
File "/Users/name/opt/anaconda3/envs/p-scRNAseq/lib/python3.9/site-packages/cello/cello.py", line 232, in predict
results_df, cell_to_clust = _raw_probabilities(
File "/Users/name/opt/anaconda3/envs/p-scRNAseq/lib/python3.9/site-packages/cello/cello.py", line 437, in _raw_probabilities
conf_df, score_df = mod.predict(expr, ad_clust.obs.index)
File "/Users/name/opt/anaconda3/envs/p-scRNAseq/lib/python3.9/site-packages/cello/models/model.py", line 89, in predict
X = self._preprocess(X)
File "/Users/name/opt/anaconda3/envs/p-scRNAseq/lib/python3.9/site-packages/cello/models/model.py", line 85, in _preprocess
X = prep.transform(X)
File "/Users/name/opt/anaconda3/envs/p-scRNAseq/lib/python3.9/site-packages/cello/models/pca.py", line 40, in transform
X = self.model.transform(X)
File "/Users/name/opt/anaconda3/envs/p-scRNAseq/lib/python3.9/site-packages/sklearn/decomposition/_base.py", line 117, in transform
X = self._validate_data(X, dtype=[np.float64, np.float32], reset=False)
File "/Users/name/opt/anaconda3/envs/p-scRNAseq/lib/python3.9/site-packages/sklearn/base.py", line 566, in _validate_data
X = check_array(X, **check_params)
File "/Users/name/opt/anaconda3/envs/p-scRNAseq/lib/python3.9/site-packages/sklearn/utils/validation.py", line 800, in check_array
_assert_all_finite(array, allow_nan=force_all_finite == "allow-nan")
File "/Users/name/opt/anaconda3/envs/p-scRNAseq/lib/python3.9/site-packages/sklearn/utils/validation.py", line 114, in _assert_all_finite
raise ValueError(
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

I can generate a model so I am happy works out for me, but even when I use the preexisting model with the model_file=f'{model_prefix}.model.dill' parameter, the same error occurs.

I've tried changing the number of highly variable genes I keep(ranging from the minimum 3000 to 10000) or even changing my number of clusters by changing n_neighbors, n_pcs, and/or resolution in sc.pp.neighbors() and sc.tl.leiden(), but varying those parameters didn't fix the error either.

The package works perfectly for me otherwise, it is just at this part do I start having errors.
I haven't seen anyone else share this problem so I am hoping you know of a fix because I am at a dead end and troubleshooting by myself isn't working out so well.

Hope you're doing well,
Todd

Understanding assignment of most specific cell type

Hello and thanks for your amazing tool. I am trying it out right now, but somehow it seems to me that either I don't quite understand how the most specif cell type is called and how the binarization is done, or that, in my case, it is not doing what it should.

See the example graph below. Please Note, that I modified the graph:

  • Grey outlines mean there is no binary information for this cell
  • Red outlines mean the binary assignment for this cell is False for this cluster
  • Green outlines mean the binary assignment for this cell is True for this cluster
  • The Octagonal shape is the actual called "most specific" cell type.

cluster_0_graph_fig_1

As you can see the CD14-positive, CD16-positive monocyte with a score of 0.27 is selected. The CD14-positive, CD16-negative classical monocyte cell type is not considered, which I don't get. It has a score of 0.45 as you can also see from this table for the respective cluster:

CD14-positive monocyte (probability) CD14-positive, CD16-negative classical monocyte (probability) CD14-positive, CD16-positive monocyte (probability) CD14-positive monocyte (binary) CD14-positive, CD16-negative classical monocyte (binary) CD14-positive, CD16-positive monocyte (binary)
0.9955782079351584 0.4540293475444127 0.2678592195788922 True False True

Strangely, I expected the CD14-positive, CD16-negative classical monocyte cells to also cross the binary threshold as these are the considered thresholds from the ir.10x_genes_thresholds.tsv file:

label label_name threshold empirical_threshold precision F1-score
CL:0001054 CD14-positive monocyte 0.5 0.9421197743088574 0.8941176470588236 0.8186714542190305
CL:0002057 CD14-positive, CD16-negative classical monocyte 0.20572466450006424 0.20572466450006424 0.047619047619047616 0.0904977375565611
CL:0002397 CD14-positive, CD16-positive monocyte 0.0021930058655731683 0.0021930058655731683 0.018518518518518517 0.03619047619047619

Is this because the predecessors (like the classical monocyte) maybe don't cross their threshold? So i predecessor information taken into account here? This would make perfect sense I guess, I was just unaware of this.

One additional question: There is no direct way to set a minimal threshold for the assignment, right? I think, if I remember correctly , I saw it somewhere in the code but the parameter is not available from the exposed function directly.

PCA Error

Hello Deweylab,

Hope the start of the summer is going well. When running CellO with scanpy with a larger single cell set (170k cells) I get the following error:

Could not find compatible pre-trained model.
Found CellO resources at './resources'.
Loading ontology...
Loading expression data from ./resources/training_set/log_tpm.h5...
Loaded matrix of shape (4293, 58243)
done.
Inferred that input file uses HGNC gene symbols.
Of 2675 genes in the input file, 2508 were found in the training set of 58243 genes.
Training model...
Fitting PCA with 3000 components...

ValueError: n_components=3000 must be between 1 and min(n_samples, n_features)=2508 with svd_solver='randomized'

Is there a suggestion to reducing the size or some other method to fix this?

I get these warnings prior to the error but I am unsure if these are causing it:

/.local/lib/python3.8/site-packages/sklearn/base.py:310: UserWarning: Trying to unpickle estimator PCA from version 0.22.2.post1 when using version 0.24.1. This might lead to breaking code or invalid results. Use at your own risk.
warnings.warn(
/.local/lib/python3.8/site-packages/sklearn/base.py:310: UserWarning: Trying to unpickle estimator LogisticRegression from version 0.22.2.post1 when using version 0.24.1. This might lead to breaking code or invalid results. Use at your own risk.
warnings.warn(

Many Thanks!
Brian

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.