dawe / schist Goto Github PK

View Code? Open in Web Editor NEW

31.0 31.0 5.0 39.64 MB

An interface for Nested Stochastic Block Model for single cell analysis

Home Page: https://schist.readthedocs.io

License: BSD 3-Clause "New" or "Revised" License

Python 99.88% Shell 0.12%

clustering single-cell

schist's People

Contributors

Stargazers

Watchers

Forkers

stuarteberg leomorelli giovp lillux namsaraeva

schist's Issues

draw_tree is broken

I haven't used schist.pl.draw_tree for a while and now discovered it is broken in the sense that cluster labels are not correctly positioned around the circle

Deprecate functions

Some functions in tools can now be deprecated and removed. One is plug_state as we don’t need it anymore, at least not in current form

Hello Dawe!
I am trying to install Schist v0.8.3 on UGent HPC to provide this sw to researchers. We are using EasyBuild to build/install and provide sw - so I like to build from source.
For now I am fighting with the dependency graph-tools. I try to install its latest version v2.68. The installation works fine, but check commands failing:
from graph_tool.all import graph_draw
from graph_tool.all import Graph, BlockState
import graph_tool.inference
All of these return same error:

Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/scratch/gent/vo/001/gvo00117/easybuild/RHEL8/cascadelake-ampere-ib/software/graph-tool/2.68-foss-2023a/lib/python3.11/site-packages/graph_tool/all.py", line 34, in <module>
        from graph_tool.draw import *
      File "/scratch/gent/vo/001/gvo00117/easybuild/RHEL8/cascadelake-ampere-ib/software/graph-tool/2.68-foss-2023a/lib/python3.11/site-packages/graph_tool/draw/__init__.py", line 87, in <module>
        from .. inference import minimize_blockmodel_dl, BlockState, ModularityState
      File "/scratch/gent/vo/001/gvo00117/easybuild/RHEL8/cascadelake-ampere-ib/software/graph-tool/2.68-foss-2023a/lib/python3.11/site-packages/graph_tool/inference/__init__.py", line 331, in <module>
        from . blockmodel import *
      File "/scratch/gent/vo/001/gvo00117/easybuild/RHEL8/cascadelake-ampere-ib/software/graph-tool/2.68-foss-2023a/lib/python3.11/site-packages/graph_tool/inference/blockmodel.py", line 119, in <module>
        @entropy_state_signature
         ^^^^^^^^^^^^^^^^^^^^^^^
      File "/scratch/gent/vo/001/gvo00117/easybuild/RHEL8/cascadelake-ampere-ib/software/graph-tool/2.68-foss-2023a/lib/python3.11/site-packages/graph_tool/inference/base_states.py", line 110, in entropy_state_signature
        warn = "\n".join([" " * (m if j == 0 else m + 4) + l.lstrip() for
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/scratch/gent/vo/001/gvo00117/easybuild/RHEL8/cascadelake-ampere-ib/software/graph-tool/2.68-foss-2023a/lib/python3.11/site-packages/graph_tool/inference/base_states.py", line 110, in <listcomp>
        warn = "\n".join([" " * (m if j == 0 else m + 4) + l.lstrip() for
                          ~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~
        TypeError: can't multiply sequence by non-int of type 'generator'

What version of graph-tools is recommended to use with Schist v0.8.3 ?

Some details about sw I use:
GCC v12.3.0 + OpenMPI v4.1.5 + FlexiBLAS v3.3.1 + FFTW v3.3.10 + ScaLAPACK v2.2.0
python v3.11.3
Boost v1.83.0
Numpy v1.25.1
Scipy v1.11.1
Pandas v2.0.3
joblib v1.2.0

Add support for MuData

While schist allows the analysis of multimodal data, it does so by passing a list of multiple AnnData objects. It would be nice to have support for MuData

Do the docs

The documentation is largely outdated, it should be rewritten (possibly from scratch)

Add parallel processing options

Since inference is performed multiple times (by the n_init parameter), it would be useful to add a n_jobs option to split independent initialisations when multiple processors are available.

Remove `state` from unstructured data and `IO` functions

Since all annotation levels are stored, we can remove the need to keep gt.NestedBlockState in adata.uns['schist']. Once annotations are present, a new state can be easily reconstructed, given that

the graph can be built from connectivities
the block state can be built from nsbm_level entries in adata.obs

This can be done if we make sure that all parameters are stored in the appropriate dictionary.
If this is implemented, there is no need to use schist.io functions to read/write as they are there only to dump the state into a separate pickle

graph-tool patch no longer needed

The current version of graph-tool (2.33) no longer needs any patching, so maybe you should just tell users to upgrade.

Font size in plotting functions

I was wondering if plotting functions should contain a parameter to set font size of group labels.
In particular especially with schist.plotting.alluvial() group labels of lower levels often overlap.

Error with nested_model() options

Hi!!!
If I try to use nested_model() with the collect_marginals and equilibrate options I have an error:

UnboundLocalError Traceback (most recent call last)
in
----> 1 nested_model(data, collect_marginals=True, equilibrate=True)

~/opt/anaconda3/envs/scanpy/lib/python3.7/site-packages/schist/inference/_nested_model.py in nested_model(adata, max_iterations, epsilon, equilibrate, wait, nbreaks, collect_marginals, niter_collect, hierarchy_length, deg_corr, multiflip, fast_model, n_init, beta_range, steps_anneal, resume, restrict_to, random_seed, key_added, adjacency, neighbors_key, directed, use_weights, prune, return_low, copy, minimize_args, equilibrate_args)
285 mcmc_equilibrate_args=equilibrate_args,
286 niter=steps_anneal,
--> 287 beta_range=beta_range)
288 if collect_marginals and equilibrate:
289 # we here only retain level_0 counts, until I can't figure out

~/opt/anaconda3/envs/scanpy/lib/python3.7/site-packages/graph_tool/inference/mcmc.py in mcmc_anneal(state, beta_range, niter, history, mcmc_equilibrate_args, verbose)
264 else:
265 S = ret[0]
--> 266 attempts += ret[1]
267 nmoves += ret[2]
268

UnboundLocalError: local variable 'attempts' referenced before assignment

I have no problem if i just use nested_model(data).

Thanks!

Remove `unknown` if `use_best` in label transfer

When scs.tl.label_transfer is issued with use_best=True, unknown label is present in transferred annotations, although it won't be used. It could be better to remove it.

No I/O for ppbm

Support to save information about Planted Partition Blocks is limited. If one has a nsbm and a ppbm objects, only the nsbm is pickled.

Improved graph-tool installation instructions

To install graph-tool via conda the actual commands are:

conda create --name gt -c conda-forge graph-tool
conda activate gt

The command currently given in the instruction will not work.

It might be good to inform users that they can install (without compilation) using homebrew in MacOS, and also in Ubuntu/Debian. The installation instructions are here: https://git.skewed.de/count0/graph-tool/-/wikis/installation-instructions

Alternative approach based on greedy merge-split MCMC

It may be worthwhile also adding an alternative strategy based on a simple greedy MCMC:

state = NestedBlockState(g)
delta = 1
while abs(delta) > 1e-6:
     delta = state.multiflip_mcmc_sweep(niter=10, beta=numpy.inf)[0]

This could be faster in some cases, while still providing good results.

Broken docs

After small code refactoring in 0.7.12 docs are broken

KeyError

Hi, I'm trying schist with a Seurat converted object, but I encountered the error below. Any suggestions on how to proceed?

adata

AnnData object with n_obs × n_vars = 4015 × 15309
obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'RNA_Condition', 'percent.mt', 'S.Score', 'G2M.Score', 'Phase', 'old.ident', 'CC.Difference', 'RNA_snn_h.orig.ident_res.0.6', 'seurat_clusters', 'RNA_snn_f.orig.ident_res.0.6', 'RNA_snn_m.orig.ident_res.0.6', 'RNA_snn_h.orig.ident_res.1.2', 'RNA_snn_f.orig.ident_res.1.2', 'RNA_snn_m.orig.ident_res.1.2', 'RNA_snn_h.orig.ident_res.1.8', 'RNA_snn_f.orig.ident_res.1.8', 'RNA_snn_m.orig.ident_res.1.8', 'SingleR_BlueprintEncodeData_labels', 'SingleRrefined_BlueprintEncodeData_labels', 'SingleR_HumanPrimaryCellAtlasData_labels', 'SingleRrefined_HumanPrimaryCellAtlasData_labels', 'SingleR_MonacoImmuneData_labels', 'SingleRrefined_MonacoImmuneData_labels', 'SingleR_DatabaseImmuneCellExpressionData_labels', 'SingleRrefined_DatabaseImmuneCellExpressionData_labels', 'SingleR_NovershternHematopoieticData_labels', 'SingleRrefined_NovershternHematopoieticData_labels'
var: 'vst.mean', 'vst.variance', 'vst.variance.expected', 'vst.variance.standardized', 'vst.variable'
uns: 'neighbors'
obsm: 'X_fastmnn.orig.ident', 'X_harmony.orig.ident', 'X_pca', 'X_tsne.fastmnn.orig.ident', 'X_tsne.harmony.orig.ident', 'X_tsne.merge.orig.ident', 'X_umap.fastmnn.orig.ident', 'X_umap.harmony.orig.ident', 'X_umap.merge.orig.ident'
varm: 'FASTMNN.ORIG.IDENT', 'HARMONY.ORIG.IDENT', 'PCs'
obsp: 'distances'

scs.inference.nested_model(adata)

/DATA_NFS/anaconda3/envs/postsc/lib/python3.8/site-packages/schist/inference/_nested_model.py:145: FutureWarning: This location for 'connectivities' is deprecated. It has been moved to .obsp[connectivities], and will not be accesible here in a future version of anndata.
adjacency = adata.uns[neighbors_key]['connectivities']

KeyError Traceback (most recent call last)
Cell In[7], line 1
----> 1 scs.inference.nested_model(a)

File /DATA_NFS/anaconda3/envs/postsc/lib/python3.8/site-packages/schist/inference/_nested_model.py:145, in nested_model(adata, deg_corr, tolerance, n_sweep, beta, samples, collect_marginals, n_jobs, restrict_to, random_seed, key_added, adjacency, neighbors_key, directed, use_weights, save_model, copy, dispatch_backend)
142 adjacency = adata.obsp[conn_key]
143 else:
144 # scanpy<=1.4.6 has sparse matrix here
--> 145 adjacency = adata.uns[neighbors_key]['connectivities']
146 if restrict_to is not None:
147 restrict_key, restrict_categories = restrict_to

File /DATA_NFS/anaconda3/envs/postsc/lib/python3.8/site-packages/anndata/compat/_overloaded_dict.py:98, in OverloadedDict.getitem(self, key)
96 def getitem(self, key):
97 if key in self.overloaded:
---> 98 return self.overloaded[key].get()
99 else:
100 return self.data[key]

File /DATA_NFS/anaconda3/envs/postsc/lib/python3.8/site-packages/anndata/compat/_overloaded_dict.py:160, in _adjacency_getter(ovld, key, adata)
154 """For overloading:
155
156 >>> mtx = adata.uns["neighbors"]["connectivities"] # doctest: +SKIP
157 >>> mtx = adata.uns["neighbors"]["distances"] # doctest: +SKIP
158 """
159 _access_warn(key, f".obsp[{key}]")
--> 160 return adata.obsp[key]

File /DATA_NFS/anaconda3/envs/postsc/lib/python3.8/site-packages/anndata/_core/aligned_mapping.py:148, in AlignedActualMixin.getitem(self, key)
147 def getitem(self, key: str) -> V:
--> 148 return self._data[key]

KeyError: 'connectivities'

RuntimeError when using schist.tl.select_affinity() a level is specified

If I try to specify which level I want to select with scnsbm.tl.select_affinity(), a RunutimeError is raised:
`schist.tl.select_affinity(adata,level=2)
ERROR: Level 2 was not found in your data

RuntimeError Traceback (most recent call last)
in
----> 1 schist.tl.select_affinity(adata,level=2)

~/anaconda3/envs/SCRNA/lib/python3.8/site-packages/schist/tools/_select.py in select_affinity(adata, level, threshold, inverse, key, update_state, filter, copy)
54 if level not in adata.uns[key]['cell_affinity']:
55 logg.error(f'Level {level} was not found in your data')
---> 56 raise
57
58 affinities = adata.uns[key]['cell_affinity'][level]

RuntimeError: No active exception to reraise

Label transfer crashes when too many cells are there

Apparently label transfer, actually the step when affinities are calculated, can crash due to memory error when too many cells are there, and "many" is not even high (>25k).
This probably is due to the fact schist tries to address a numpy array which is too big to be managed. Going with sparse implementations could be the way, but I'm afraid the affinity matrix is not really sparse, unless we set a threshold under which every thing is actually 0

scs.tl.label_transfer error

Hello!
I have this error when I apply scs.tl.label_transfer function (and I made sure the annotation is categorical):

AttributeError Traceback (most recent call last)
in
----> 1 scs.tl.label_transfer(rna_p, rna_u, obs='leiden_rna_u')

/beegfs/scratch/ric.cosr/giansanti.valentina/.conda/envs/dnn_cnv2/lib/python3.9/site-packages/schist/tools/_affinity_tools.py in label_transfer(adata, adata_ref, obs, label_unk, use_best, neighbors_key, adjacency, directed, use_weights, pca_args, use_rep, harmony_args, copy)
461 batch_key='_label_transfer')
462 #
--> 463 adata_merge.obs[obs] = adata_merge.obs[obs].cat.add_categories(label_unk).fillna(label_unk)
464
465 # perform integration using harmony

/beegfs/scratch/ric.cosr/giansanti.valentina/.conda/envs/dnn_cnv2/lib/python3.9/site-packages/pandas/core/generic.py in getattr(self, name)
5459 or name in self._accessors
5460 ):
-> 5461 return object.getattribute(self, name)
5462 else:
5463 if self._info_axis._can_hold_identifiers_and_holds_name(name):

/beegfs/scratch/ric.cosr/giansanti.valentina/.conda/envs/dnn_cnv2/lib/python3.9/site-packages/pandas/core/accessor.py in get(self, obj, cls)
178 # we're accessing the attribute of the class, i.e., Dataset.geo
179 return self._accessor
--> 180 accessor_obj = self._accessor(obj)
181 # Replace the property with the accessor object. Inspired by:
182 # https://www.pydanny.com/cached-property.html

/beegfs/scratch/ric.cosr/giansanti.valentina/.conda/envs/dnn_cnv2/lib/python3.9/site-packages/pandas/core/arrays/categorical.py in init(self, data)
2455
2456 def init(self, data):
-> 2457 self._validate(data)
2458 self._parent = data.values
2459 self._index = data.index

/beegfs/scratch/ric.cosr/giansanti.valentina/.conda/envs/dnn_cnv2/lib/python3.9/site-packages/pandas/core/arrays/categorical.py in _validate(data)
2464 def _validate(data):
2465 if not is_categorical_dtype(data.dtype):
-> 2466 raise AttributeError("Can only use .cat accessor with a 'category' dtype")
2467
2468 def _delegate_property_get(self, name):

AttributeError: Can only use .cat accessor with a 'category' dtype

Fix draw_tree

There are some issues with draw_tree, that is

it doesn't take into account continuous values outside [0, 1] range (could be fixed rescaling data)
it raises some warnings due to recent updates in matplotlib

MatplotlibDeprecationWarning: 
The modification of the Axes.artists property was deprecated in Matplotlib 3.5 and will be removed two minor releases later. Use Axes.add_artist instead.
  self.insert(len(self), value)

it doesn't plot in notebooks, again due to changes in matplotlib

AttributeError: module 'schist' has no attribute 'inference'

I am following single-cell best practice tutorial, when I ran this code:

import schist as scs

scs.inference.nested_model(adata, samples = 100, random_seed = 5678)

It raised this error.

I am using schist version 0.7.16+2.gb762b76

Thank you so much for assisting me!

draw_tree error, adata.uns['schist']['state'] is not assigned

schist runs successfully, including the scs.inference.nested_model(adata) function. However, as a result adata.uns['schist']['state'] is not assigned, and scs.plotting.draw_tree(adata) give the error KeyError: 'state'.
Going back to the code, I can see that nested_model lists adata.uns['schist']['state'] as an output, however, this is never assigned later in the file.

I also saw a previous issue "Remove state from unstructured data and IO functions #12" where 'state' is removed from adata.uns['schist']. A relevant bug is fixed in version 0.7.11 that I am using right now. The problem is that pl.draw_tree() still requires the 'state' (pls see in line 83).

I am not sure how to proceed or if I am totally missing the point. Could the producers confirm that the package is running well for them?

graph-tool v2.57 breaks results

I have recently updated graph-tool from v2.55 to v2.57, I’ve noticed weird results when using scs.inference.nested_model. I’m tracking down the issue, but until then the last version working with schist is v2.55. I am going to release a patch for the conda installation ASAP.

Performance degradation after updating graph-tool

I observed a general degradation of performances and longer runtimes after upgrading graph-tool to version 2.40
Latest graph-tool version introduced some optimization (possibly OMP based) which I believe collide with the joblib parallelization we use to run multiple models at the same time.
A possible workaround could be downgrade to gt version 2.37 or set n_jobs=1