parashardhapola / scarf Goto Github PK

Toolkit for highly memory efficient analysis of single-cell RNA-Seq, scATAC-Seq and CITE-Seq data. Analyze atlas scale datasets with millions of cells on laptop.

Home Page: http://scarf.readthedocs.io

License: BSD 3-Clause "New" or "Revised" License

Python 99.66% Dockerfile 0.34%

bioinformatics single-cell-genomics single-cell-rna-seq dimension-reduction clustering umap tsne big-data cite-seq single-cell-atac-seq

scarf's People

Contributors

Stargazers

Watchers

Forkers

gitter-badger razofz johanrodhe bioarpit1 stela2502 ahmedarslan crsky1023 remy-ai nygenanalytics oskbor hi-ilkin

scarf's Issues

Aggregated TODO list

Using this issue to track all the feature requests. I have tried to loosely categorize them as well

New functionality

Data 'bulkification': The idea here is to merge the data from cells within the same clusters. This can then have two downstream usages: 1) bulk data can be exported out for use in external software like GSEA and 2) mapping speed up mapping only to bulked values. (Solved here: aad885a)

Improved support

Extend meld_assay to support enhancers and TF motifs

Visualization

Add Beadplot
Add hullplots

Support for formats and dataset management

Anndata to Zarr convertor (Solved here: d41c3df and 04f20e2)
Zarr to Anndata object
Merge two or more Zarr files

If the assay is not RNA - e.g. spliced from a kallisto/bustools run

If I load a loom file from a kallisto/bustools run using

reader = scarf.readers.LoomReader( ifile, cell_names_key='CellID', feature_names_key="Gene")
writer = scarf.LoomToZarr(
    reader,
    zarr_fn='scarf_datasets/data/data.zarr',
    chunk_size=(2000, 1000),
    assay_name = "spliced"
)
writer.dump(batch_size=1000)
ds = scarf.DataStore(
    'scarf_datasets/data/data.zarr',
    nthreads=4,
    min_features_per_cell=100
)

This step breaks:

ds.mark_hvgs(
    min_cells=20,
    top_n=500,
    min_mean=-3,
    max_mean=2,
    max_var=6,
    from_assay='spliced'
)

Mark the from_assay='spliced' - but then I also need this:

ds._get_assay('spliced').__class__ = scarf.assay.RNAassay

As this assay was a scarf.assay.Assay class when loaded.

API documentation on readthedocs

Thought about taking this in the latest pull request, but I think this is a more general discussion (and the pull request is pretty crowded as it is).

There are two points to discuss here:

Currently, we provide descriptions in both the api.rst file, as well as in the docstrings, for the classes included in the API docs page. I. e., redundant. I would say that if we don't have anything extra we want to write in the api page, we should remove the descriptions from the rst file and let the docstrings be automatically included. Then the same descriptions can be found both on the docs API page and through Python help/docs functions. What do you think?
In the API docs page, we currently show only a couple of classes (e. g. Assay, DataStore). We do this with the sphinx autoclass directive. There exists a directive automodule, where everything from a module will be included in the API page, including the classes already there. Do we want to showcase only a select part of the classes, or should we take a more comprehensive approache and include the full module documentations?

Aimed at @parashardhapola 🙂

basic_tutorial.md - no normalization step?

Is scarf internally normalizing the data without external influence or am I simply missing the step in the tutorial?

Proximity of clusters in binary tree graph

In the basic_tutoral notebook:

"The tree is free form (i.e the position of clusters doesn't convey any meaning) but allows inspection of cluster similarity based on branching pattern. The sizes of clusters indicate the number of cells present in each cluster. The tree starts from the root node (black dot with no incoming edges). As an example, one can observe by looking at the branching pattern that cluster 1 is closer to cluster 8 than it is to cluster 4 since 1 and 8 share parent node whereas 4 is part of another branch. Cluster 4 is in turn closest to cluster 17."

Is what is in bold above correct?

Function "calc_mean_rank" failed type inference

If the group_key parameter of find_markers_by_rank refers to a non-numeric column in cell metadata then the Numba compilation of calc_mean_rank fails

Unsure what this error message is trying to tell me

Trying to follow the vignette as described here (https://github.com/parashardhapola/scarf/issues/new?assignees=&labels=&template=bug_report.md&title=) for scrna-seq analysis. The error message is a little ambiguous and my best guess is either an issue with loading certain libraries or incompatible dependancies.

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
_ctypes/callbacks.c in 'calling callback function'()

/opt/conda/lib/python3.7/site-packages/threadpoolctl.py in match_library_callback(info, size, data)
    546 
    547                 # Store the library controller if it is supported and selected
--> 548                 self._make_controller_from_path(filepath)
    549             return 0
    550 

/opt/conda/lib/python3.7/site-packages/threadpoolctl.py in _make_controller_from_path(self, filepath)
    671                 prefix=prefix,
    672                 user_api=user_api,
--> 673                 internal_api=internal_api,
    674             )
    675             self.lib_controllers.append(lib_controller)

/opt/conda/lib/python3.7/site-packages/threadpoolctl.py in __init__(self, **kwargs)
    784 
    785     def __init__(self, **kwargs):
--> 786         super().__init__(**kwargs)
    787         self.threading_layer = self._get_threading_layer()
    788         self.architecture = self._get_architecture()

/opt/conda/lib/python3.7/site-packages/threadpoolctl.py in __init__(self, filepath, prefix, user_api, internal_api)
    752         self.prefix = prefix
    753         self.filepath = filepath
--> 754         self._dynlib = ctypes.CDLL(filepath, mode=_RTLD_NOLOAD)
    755         self.version = self.get_version()
    756 

/opt/conda/lib/python3.7/ctypes/__init__.py in __init__(self, name, mode, handle, use_errno, use_last_error)
    362 
    363         if handle is None:
--> 364             self._handle = _dlopen(self._name, mode)
    365         else:
    366             self._handle = handle

OSError: dlopen() error

To Reproduce

ds.make_graph(
    feat_key='hvgs',
    k=11,
    dims=15,
    n_centroids=100
)

Scarf and Python version

Python 3.7.12
Scarf 0.26.3
Numpy 1.20.0
Scipy 1.7.3

can I convert and import Seurat/SCE object to scarf

Dear developers of scarf, is there a way to import Seurat or singleCellExperiment object into scarf? Thanks!

Not all dependencies included in Docker image

Going through the basic_tutorial notebook. at the plot_marker_heatmap cell I got these errors:

ds.plot_marker_heatmap(group_key='RNA_cluster', topn=3)

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
~/miniconda3/lib/python3.8/site-packages/dask/dataframe/__init__.py in <module>
     12     from .groupby import Aggregation
---> 13     from .io import (
     14         from_array,

~/miniconda3/lib/python3.8/site-packages/dask/dataframe/io/__init__.py in <module>
     12 )
---> 13 from .csv import read_csv, to_csv, read_table, read_fwf
     14 from .hdf import read_hdf, to_hdf

~/miniconda3/lib/python3.8/site-packages/dask/dataframe/io/csv.py in <module>
     23 # this import checks for the importability of fsspec
---> 24 from ...bytes import read_bytes, open_file, open_files
     25 from ..core import new_dd_object

~/miniconda3/lib/python3.8/site-packages/dask/bytes/__init__.py in <module>
      8 if fsspec is None or LooseVersion(fsspec.__version__) < LooseVersion("0.3.3"):
----> 9     raise ImportError(
     10         "fsspec is required to use any file-system functionality."

ImportError: fsspec is required to use any file-system functionality. Please install using
conda install -c conda-forge 'fsspec>=0.3.3'
or
python -m pip install 'fsspec>=0.3.3'

The above exception was the direct cause of the following exception:

ImportError                               Traceback (most recent call last)
<ipython-input-40-03ef8e60d5a0> in <module>
----> 1 ds.plot_marker_heatmap(group_key='RNA_cluster', topn=3)

/workspace/scarf/datastore.py in plot_marker_heatmap(self, from_assay, group_key, subset_key, topn, log_transform, vmin, vmax, **heatmap_kwargs)
   2287         normed_data = assay.normed(cell_idx=cell_idx, feat_idx=feat_idx[feat_argsort], log_transform=log_transform)
   2288         nc = normed_data.chunks[0]
-> 2289         normed_data = normed_data.to_dask_dataframe()
   2290         groups = daskarr.from_array(assay.cells.fetch(group_key, subset_key), chunks=nc).to_dask_dataframe()
   2291         df = controlled_compute(normed_data.groupby(groups).mean(), 4)

~/miniconda3/lib/python3.8/site-packages/dask/array/core.py in to_dask_dataframe(self, columns, index, meta)
   1502         dask.dataframe.from_dask_array
   1503         """
-> 1504         from ..dataframe import from_dask_array
   1505 
   1506         return from_dask_array(self, columns=columns, index=index, meta=meta)

~/miniconda3/lib/python3.8/site-packages/dask/dataframe/__init__.py in <module>
     55         '  python -m pip install "dask[dataframe]" --upgrade  # or python -m pip install'
     56     )
---> 57     raise ImportError(msg) from e

ImportError: Dask dataframe requirements are not installed.

Please either conda or pip install as follows:

  conda install dask                     # either conda install
  python -m pip install "dask[dataframe]" --upgrade  # or python -m pip install

I'm using a Docker container built from the Dockerfile included in the repo. It should work out of the box with that.
Will see if I can fix it, just recording the issue when I encountered so it doesn't get lost

The CrDirReader is strange

Why do you look for the gzipped features.tsv.gz but the unzipped genes.tsv?
Should you not look for features.tsv.gz and genes.tsv.gz as well as genes.tsv.gz and genes.tsv?

if os.path.isfile(self.loc + 'features.tsv.gz'):
self.matFn = self.loc + 'matrix.mtx.gz'
grps = {'feature_ids': ('features.tsv.gz', 0),
'feature_names': ('features.tsv.gz', 1),
'feature_types': ('features.tsv.gz', 2),
'cell_names': ('barcodes.tsv.gz', 0)}
elif os.path.isfile(self.loc + 'genes.tsv'):
self.matFn = self.loc + 'matrix.mtx'
grps = {'feature_ids': ('genes.tsv', 0),
'feature_names': ('genes.tsv', 1),
'feature_types': None,
'cell_names': ('barcodes.tsv', 0)}
else:
raise IOError("ERROR: Couldn't find files")

Support for Trimodal multiomic data analysis

Thanks for developing wonderful single cell analysis pipeline. Use case examples are great to start hands on with analysis pipeline is it possible to get a similar use case for either TEA seq or DOGMA seq dataset

Expression data gets plotted as grouping data

When the expression is not normalized (resampled and 'fake norm') the then low complexity integer values do not get the 'expression' color schemata, but the 'grouping' color.
It would be great to still get the same colors schemata even for low complexity genes.
I had the same problem with the nFeature if using a resamples data set. It would be great to force whether to use a continuous color or a discrete color for any plot.

retain QC filtering and columns after scarf.make.subset()

Hi,

In regards to my former issue: "Retain QC filtering after scarf.ZarrMerge()?"
#28

I realized that I lose the QC filtering already at the make_subset, the step before ZarrMerge . So a similar fix to retain the QC filtering and columns after scarf.make_subset is also needed. Possible to add?

Anndata's H5ad read/write support

Make this function more tolerant to different datasets:

class H5adReader:
    def __init__(self, h5ad_fn, cname=Null, gname=Null ):

if ( cname == Null ):
   use h5ad.obs.index
else:
  use h5ad.obs[cname]

...

Merge / project more samples onto one sample.

Yes just that one ;-)

Retain QC filtering after scarf.ZarrMerge()?

Hi,

I am want to merge a set of zarr files in multiple different combinations with scarf.ZarrMerge(). But after merge the QC filtering in "I" column is reset and lost. Is there a way to retain this data during the merge so I don't have to filter all the different file combinations individually after the merge?

thanks,

Rebecca

Computing kNN graphs on user-provided dimensional reductions?

Hi! This is fantastic work, congratulations!

I'm, however, saddened by the somewhat hard-coded kNN graph computation, based solely on PCA, which can be misleading if data does not necessarily lie in a series of linear subspaces. A huge deal of work has been done lately on dimensional reduction, and thus restricting the kNN graphs to PCA is an important limitation. Is there any way to compute the kNN graphs using a user-provided dimensionality reduced basis (i.e., ICA, CCA, DM, etc.)? In Scanpy and Seurat, users can specify the basis they would like to use.

Alternatively, how can users add externally computed kNN graphs and dimensional reductions to Scarf objects?

I've only got started with Scarf, so any directions on which parts of the code should be adapted to make this an option to the end-user would be fantastic.

Thanks!

Documentation enhancement for SubsetZarr

Recently an issue was solved where scarf.writers.SubsetZarr did not produce the expected outfiles.
Now I hit the same problem following the info on this web page: https://scarf.readthedocs.io/en/latest/api.html

I would greatly appreciate if a short example could be added to the api documentation. Like:

writer = scarf.writers.SubsetZarr( in_zarr = "full.zarr", out_zarr = "subset.zarr", cell_key="yourBinaryKey", overwrite_existing_file=True)
writer.dump() # to write the data to disk

a problem import scarf (wrong version of tqdm)

Dear scarf team, I installed scarf using pip successfully, but I encounter the error message below when I tried to import scarf.
Weirdly I already installed tqdm by pip install tqdm, and it says Requirement already satisfied: tqdm in ./opt/anaconda3/lib/python3.8/site-packages (4.50.2). Do you by any chance know what might cause this error message? Thank you very much!

`---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
in
1 import pandas as pd
----> 2 import scarf

~/opt/anaconda3/lib/python3.8/site-packages/scarf/init.py in
45 print("Scarf is not installed", flush=True)
46
---> 47 from .datastore import *
48 from .readers import *
49 from .writers import *

~/opt/anaconda3/lib/python3.8/site-packages/scarf/datastore.py in
13 import dask.array as daskarr
14 from scipy.sparse import csr_matrix, coo_matrix
---> 15 from .writers import create_zarr_dataset, create_zarr_obj_array
16 from .metadata import MetaData
17 from .assay import Assay, RNAassay, ATACassay, ADTassay

~/opt/anaconda3/lib/python3.8/site-packages/scarf/writers.py in
23 from typing import Any, Tuple, List, Union, Dict
24 import numpy as np
---> 25 from .readers import CrReader, H5adReader, NaboH5Reader, LoomReader
26 import os
27 import pandas as pd

~/opt/anaconda3/lib/python3.8/site-packages/scarf/readers.py in
20 from typing import IO
21 import h5py
---> 22 from .utils import logger, tqdmbar
23
24 all = [

~/opt/anaconda3/lib/python3.8/site-packages/scarf/utils.py in
14 import sys
15 import numpy as np
---> 16 from tqdm.dask import TqdmCallback
17 from dask.array.core import Array
18 from tqdm.auto import tqdm as std_tqdm

ModuleNotFoundError: No module named 'tqdm.dask'`

vignette step fails due to missing SG-tSNE libs

I noticed that when building the docs page, cell 20 and 21 of the basic workflow vignette failed:

ERROR: SG-tSNE failed, possibly due to missing libraries or file permissions. SG-tSNE currently fails on readthedocs

And it also fail on readthedocs. Judging from the output error, you're probably already aware of this :) It works if one installs the current version from PyPI (0.7.7), but not when installing from the current source code (and on readthedocs, as seen above).

I saw some commit where you removed the sgtsne libs from the repo, do you have a plan to handle this in another way or should we discuss possibilities?

Additonal feature request to ds.plot_layout()

I am missing the following features in the ds.plot_layout, If you could add any of them it would be much appreciated:

An option to give new axis label
So I can hide labels given at : ds.make_graph(label = "a")) and keep RNA_UMAP1/RNA_UMAP2
The option to put your on title ontop of the umap.
So you for ex. can put in you sample name. Currently it is automatically states the selected parameter for color_by in ds.plot_layout()
After projections onto another a data set, to be able to sort cells displayed in the umap based on projections values. So the smaller dots ends up in the bottom layer of the umap.

Thanks!

h5py - minimal requirement

h5ps in the 3.2.x versions does not support to write dtype(0) objects to h5 files:

TypeError: Object dtype dtype('O') has no native HDF5 equivalent

the default print function

It would be helpful if the default print function would be more phony.
Which cell col names? Which RNA.feats col names?
Also which layout_key's are there in the data?

It might also help to improve the error message in plot_layout - if the UMAP1 is not found - what is there and possible - where did you check for that?

Datastore, sgtsne not working in Docker container

Similar to #16, it seems the sgtsne command is not working. As in the referenced issue, I am using the provided Docker container.

ds.run_tsne(alpha=20, box_h=1)

Saving KNN matrix in MTX format: 100%|██████████| 7648/7648 [00:02<00:00, 2884.05it/s]

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-36-d9c983b4281c> in <module>
----> 1 ds.run_tsne(alpha=20, box_h=1)

/workspace/scarf/datastore.py in run_tsne(self, from_assay, cell_key, feat_key, min_edge_weight, symmetric_graph, graph_upper_only, ini_embed, tsne_dims, lambda_scale, max_iter, early_iter, alpha, box_h, temp_file_loc, label, verbose)
    995               f" -h {box_h} -i {ini_emb_fn} -o {out_fn} {knn_mtx_fn}"
    996         if verbose:
--> 997             system_call(cmd)
    998         else:
    999             os.system(cmd)

/workspace/scarf/utils.py in system_call(command)
     89     import shlex
     90 
---> 91     process = subprocess.Popen(shlex.split(command), stdout=subprocess.PIPE)
     92     while True:
     93         output = process.stdout.readline()

~/miniconda3/lib/python3.8/subprocess.py in __init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags, restore_signals, start_new_session, pass_fds, encoding, errors, text)
    852                             encoding=encoding, errors=errors)
    853 
--> 854             self._execute_child(args, executable, preexec_fn, close_fds,
    855                                 pass_fds, cwd, env,
    856                                 startupinfo, creationflags, shell,

~/miniconda3/lib/python3.8/subprocess.py in _execute_child(self, args, executable, preexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite, restore_signals, start_new_session)
   1700                     if errno_num != 0:
   1701                         err_msg = os.strerror(errno_num)
-> 1702                     raise child_exception_type(errno_num, err_msg, err_filename)
   1703                 raise child_exception_type(err_msg)
   1704 

FileNotFoundError: [Errno 2] No such file or directory: 'sgtsne'

More complete extract from datastore.py:

        # [..]
        cmd = f"sgtsne -m {max_iter} -l {lambda_scale} -d {tsne_dims} -e {early_iter} -p 1 -a {alpha}" \
              f" -h {box_h} -i {ini_emb_fn} -o {out_fn} {knn_mtx_fn}"
        if verbose:
            system_call(cmd)
        # [..]

ds.RNA.mark_hvgs(min_cells=100, top_n=2000)

I get the error:
IndexError: index 2000 is out of bounds for axis 0 with size 14

change to
ds.RNA.mark_hvgs(min_cells=100, top_n=14)
I get the error:
IndexError: index 14 is out of bounds for axis 0 with size 14

there is a check missing!

Default t-SNE parameters

Great paper and great package. Amazing work!

I am specifically interested in your UMAP/t-SNE comparisons and benchmarks and am now trying to figure out the SG-t-SNE-Pi default parameters that you use. As far as I understood your API, your defaults are max_iter=500, early_iter=200, alpha=10 where alpha denotes early exaggeration coefficient. I noticed that 10 and 200 are slightly different from the default values in most existing t-SNE implementations (12 and 250), I wonder why. But they are pretty close so it does not really matter. What is not mentioned though is the learning rate. Learning rate can have a huge influence on t-SNE embeddings and speed of convergence. See https://www.nature.com/articles/s41467-019-13056-x and https://www.nature.com/articles/s41467-019-13055-y that recommend setting learning rate to n/12 where n is the sample size. What learning rate is used by the SG-t-SNE-Pi implementation that you use?

Unrelated, if I understood the "SG" part and your implementation correctly, you construct a kNN graph using k=10, then assign UMAP weights to the edges, and then when running t-SNE, SG-t-SNE will normalize each row of the affinity matrix to sum to 1. Then symmetrize and run t-SNE as usual. Right? If I understood correctly, then this is pretty much exactly how it should be implemented in Scanpy soon, see pending scverse/scanpy#1561 by @pavlin-policar. Nice.

Finally, I am not entirely sure I understood your initialization approach. It's great that you use the same initialization for t-SNE and UMAP (another relevant paper here: https://www.nature.com/articles/s41587-020-00809-z). But I am confused by the following bit:

PCA is performed on the Kmeans clusters centroid matrix which of form n × c, where n is the number of cells and c is the number of centroids.

Is this a binary matrix that has 1 in position ij if cell i belongs to cluster j? If so, I'm not quite sure what's the point of running PCA on such a matrix? I'm probably misunderstanding.

pip installation fails with this ERROR: Could not install packages due to an OSError: [Errno 30] Read-only file system: '/opt/python/3.9.14/lib/python3.9/site-packages/texttable.py'

Collecting scarf
Downloading scarf-0.26.3-py3-none-any.whl (200 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 200.3/200.3 KB 3.7 MB/s eta 0:00:00a 0:00:01
Collecting hnswlib
Downloading hnswlib-0.7.0.tar.gz (33 kB)
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing metadata (pyproject.toml) ... done
Requirement already satisfied: importlib-metadata in /opt/python/3.9.14/lib/python3.9/site-packages (from scarf) (6.5.0)
Collecting scikit-network
Downloading scikit_network-0.30.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.2/7.2 MB 54.6 MB/s eta 0:00:0000:0100:01
Collecting scikit-learn
Using cached scikit_learn-1.2.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.6 MB)
Collecting umap-learn
Using cached umap-learn-0.5.3.tar.gz (88 kB)
Preparing metadata (setup.py) ... done
Collecting dask>=2021.8.1
Downloading dask-2023.4.1-py3-none-any.whl (1.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 42.3 MB/s eta 0:00:00
Collecting ipython-autotime
Downloading ipython_autotime-0.3.1-py2.py3-none-any.whl (6.8 kB)
Collecting pybind11
Using cached pybind11-2.10.4-py3-none-any.whl (222 kB)
Collecting gensim
Downloading gensim-4.3.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (26.5 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 26.5/26.5 MB 52.5 MB/s eta 0:00:0000:0100:01
Collecting loguru
Downloading loguru-0.7.0-py3-none-any.whl (59 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 60.0/60.0 KB 3.4 MB/s eta 0:00:00
Collecting pandas
Downloading pandas-2.0.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.4 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.4/12.4 MB 69.7 MB/s eta 0:00:0000:0100:01
Collecting numba
Using cached numba-0.56.4-cp39-cp39-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (3.5 MB)
Collecting leidenalg
Downloading leidenalg-0.9.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 46.0 MB/s eta 0:00:00
Requirement already satisfied: ipywidgets in /opt/python/3.9.14/lib/python3.9/site-packages (from scarf) (8.0.6)
Collecting numpy
Using cached numpy-1.24.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
Collecting datashader
Downloading datashader-0.14.4-py2.py3-none-any.whl (18.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.2/18.2 MB 59.2 MB/s eta 0:00:0000:0100:01
Collecting numcodecs
Downloading numcodecs-0.11.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.7 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.7/6.7 MB 75.7 MB/s eta 0:00:00:00:0100:01
Collecting joblib
Using cached joblib-1.2.0-py3-none-any.whl (297 kB)
Collecting tqdm
Using cached tqdm-4.65.0-py3-none-any.whl (77 kB)
Collecting h5py>=3.3.0
Using cached h5py-3.8.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.7 MB)
Collecting networkx
Using cached networkx-3.1-py3-none-any.whl (2.1 MB)
Requirement already satisfied: setuptools in /opt/python/3.9.14/lib/python3.9/site-packages (from scarf) (67.6.1)
Collecting seaborn
Using cached seaborn-0.12.2-py3-none-any.whl (293 kB)
Collecting statsmodels
Using cached statsmodels-0.13.5-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.9 MB)
Collecting kneed
Downloading kneed-0.8.3-py3-none-any.whl (10 kB)
Requirement already satisfied: requests in /opt/python/3.9.14/lib/python3.9/site-packages (from scarf) (2.28.2)
Collecting matplotlib
Using cached matplotlib-3.7.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.6 MB)
Collecting cmocean
Downloading cmocean-3.0.3-py3-none-any.whl (222 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 222.1/222.1 KB 15.3 MB/s eta 0:00:00
Collecting scipy
Using cached scipy-1.10.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.5 MB)
Collecting zarr
Downloading zarr-2.14.2-py3-none-any.whl (203 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 203.3/203.3 KB 14.5 MB/s eta 0:00:00
Collecting threadpoolctl
Using cached threadpoolctl-3.1.0-py3-none-any.whl (14 kB)
Requirement already satisfied: packaging in /opt/python/3.9.14/lib/python3.9/site-packages (from scarf) (23.1)
Requirement already satisfied: click>=8.0 in /opt/python/3.9.14/lib/python3.9/site-packages (from dask>=2021.8.1->scarf) (8.1.3)
Requirement already satisfied: pyyaml>=5.3.1 in /opt/python/3.9.14/lib/python3.9/site-packages (from dask>=2021.8.1->scarf) (6.0)
Collecting fsspec>=2021.09.0
Downloading fsspec-2023.4.0-py3-none-any.whl (153 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 154.0/154.0 KB 10.0 MB/s eta 0:00:00
Collecting partd>=1.2.0
Downloading partd-1.4.0-py3-none-any.whl (18 kB)
Collecting toolz>=0.10.0
Downloading toolz-0.12.0-py3-none-any.whl (55 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 55.8/55.8 KB 3.4 MB/s eta 0:00:00
Collecting cloudpickle>=1.5.0
Downloading cloudpickle-2.2.1-py3-none-any.whl (25 kB)
Requirement already satisfied: zipp>=0.5 in /opt/python/3.9.14/lib/python3.9/site-packages (from importlib-metadata->scarf) (3.15.0)
Requirement already satisfied: python-dateutil>=2.8.2 in /opt/python/3.9.14/lib/python3.9/site-packages (from pandas->scarf) (2.8.2)
Collecting pytz>=2020.1
Using cached pytz-2023.3-py2.py3-none-any.whl (502 kB)
Collecting tzdata>=2022.1
Using cached tzdata-2023.3-py2.py3-none-any.whl (341 kB)
Collecting xarray
Downloading xarray-2023.4.2-py3-none-any.whl (979 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 979.5/979.5 KB 40.7 MB/s eta 0:00:00
Collecting colorcet
Downloading colorcet-3.0.1-py2.py3-none-any.whl (1.7 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 52.4 MB/s eta 0:00:00
Collecting pillow
Using cached Pillow-9.5.0-cp39-cp39-manylinux_2_28_x86_64.whl (3.4 MB)
Collecting pyct
Downloading pyct-0.5.0-py2.py3-none-any.whl (15 kB)
Collecting datashape
Downloading datashape-0.5.2.tar.gz (76 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 76.5/76.5 KB 4.5 MB/s eta 0:00:00
Preparing metadata (setup.py) ... done
Collecting param
Downloading param-1.13.0-py2.py3-none-any.whl (87 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 87.3/87.3 KB 5.2 MB/s eta 0:00:00
Collecting numpy
Using cached numpy-1.23.5-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.1 MB)
Collecting llvmlite<0.40,>=0.39.0dev0
Using cached llvmlite-0.39.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.6 MB)
Collecting smart-open>=1.8.1
Downloading smart_open-6.3.0-py3-none-any.whl (56 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.8/56.8 KB 3.6 MB/s eta 0:00:00
Requirement already satisfied: ipython in /opt/python/3.9.14/lib/python3.9/site-packages (from ipython-autotime->scarf) (8.12.0)
Requirement already satisfied: traitlets>=4.3.1 in /opt/python/3.9.14/lib/python3.9/site-packages (from ipywidgets->scarf) (5.9.0)
Requirement already satisfied: ipykernel>=4.5.1 in /opt/python/3.9.14/lib/python3.9/site-packages (from ipywidgets->scarf) (6.22.0)
Requirement already satisfied: jupyterlab-widgets~=3.0.7 in /opt/python/3.9.14/lib/python3.9/site-packages (from ipywidgets->scarf) (3.0.7)
Requirement already satisfied: widgetsnbextension~=4.0.7 in /opt/python/3.9.14/lib/python3.9/site-packages (from ipywidgets->scarf) (4.0.7)
Collecting igraph<0.11,>=0.10.0
Downloading igraph-0.10.4-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.3/3.3 MB 61.7 MB/s eta 0:00:00:00:01
Collecting fonttools>=4.22.0
Using cached fonttools-4.39.3-py3-none-any.whl (1.0 MB)
Collecting importlib-resources>=3.2.0
Using cached importlib_resources-5.12.0-py3-none-any.whl (36 kB)
Collecting contourpy>=1.0.1
Using cached contourpy-1.0.7-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (299 kB)
Collecting kiwisolver>=1.0.1
Using cached kiwisolver-1.4.4-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.6 MB)
Collecting pyparsing>=2.3.1
Using cached pyparsing-3.0.9-py3-none-any.whl (98 kB)
Collecting cycler>=0.10
Using cached cycler-0.11.0-py3-none-any.whl (6.4 kB)
Collecting entrypoints
Downloading entrypoints-0.4-py3-none-any.whl (5.3 kB)
Requirement already satisfied: idna<4,>=2.5 in /opt/python/3.9.14/lib/python3.9/site-packages (from requests->scarf) (3.4)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/python/3.9.14/lib/python3.9/site-packages (from requests->scarf) (1.26.15)
Requirement already satisfied: charset-normalizer<4,>=2 in /opt/python/3.9.14/lib/python3.9/site-packages (from requests->scarf) (3.1.0)
Requirement already satisfied: certifi>=2017.4.17 in /opt/python/3.9.14/lib/python3.9/site-packages (from requests->scarf) (2022.12.7)
Collecting patsy>=0.5.2
Using cached patsy-0.5.3-py2.py3-none-any.whl (233 kB)
Collecting pynndescent>=0.5
Using cached pynndescent-0.5.10.tar.gz (1.1 MB)
Preparing metadata (setup.py) ... done
Collecting fasteners
Downloading fasteners-0.18-py3-none-any.whl (18 kB)
Collecting asciitree
Downloading asciitree-0.3.3.tar.gz (4.0 kB)
Preparing metadata (setup.py) ... done
Collecting texttable>=1.6.2
Downloading texttable-1.6.7-py2.py3-none-any.whl (10 kB)
Requirement already satisfied: debugpy>=1.6.5 in /opt/python/3.9.14/lib/python3.9/site-packages (from ipykernel>=4.5.1->ipywidgets->scarf) (1.6.7)
Requirement already satisfied: jupyter-core!=5.0.,>=4.12 in /opt/python/3.9.14/lib/python3.9/site-packages (from ipykernel>=4.5.1->ipywidgets->scarf) (5.3.0)
Requirement already satisfied: nest-asyncio in /opt/python/3.9.14/lib/python3.9/site-packages (from ipykernel>=4.5.1->ipywidgets->scarf) (1.5.6)
Requirement already satisfied: jupyter-client>=6.1.12 in /opt/python/3.9.14/lib/python3.9/site-packages (from ipykernel>=4.5.1->ipywidgets->scarf) (8.2.0)
Requirement already satisfied: matplotlib-inline>=0.1 in /opt/python/3.9.14/lib/python3.9/site-packages (from ipykernel>=4.5.1->ipywidgets->scarf) (0.1.6)
Requirement already satisfied: pyzmq>=20 in /opt/python/3.9.14/lib/python3.9/site-packages (from ipykernel>=4.5.1->ipywidgets->scarf) (25.0.2)
Requirement already satisfied: tornado>=6.1 in /opt/python/3.9.14/lib/python3.9/site-packages (from ipykernel>=4.5.1->ipywidgets->scarf) (6.3)
Requirement already satisfied: comm>=0.1.1 in /opt/python/3.9.14/lib/python3.9/site-packages (from ipykernel>=4.5.1->ipywidgets->scarf) (0.1.3)
Requirement already satisfied: psutil in /opt/python/3.9.14/lib/python3.9/site-packages (from ipykernel>=4.5.1->ipywidgets->scarf) (5.9.5)
Requirement already satisfied: pickleshare in /opt/python/3.9.14/lib/python3.9/site-packages (from ipython->ipython-autotime->scarf) (0.7.5)
Requirement already satisfied: prompt-toolkit!=3.0.37,<3.1.0,>=3.0.30 in /opt/python/3.9.14/lib/python3.9/site-packages (from ipython->ipython-autotime->scarf) (3.0.38)
Requirement already satisfied: stack-data in /opt/python/3.9.14/lib/python3.9/site-packages (from ipython->ipython-autotime->scarf) (0.6.2)
Requirement already satisfied: jedi>=0.16 in /opt/python/3.9.14/lib/python3.9/site-packages (from ipython->ipython-autotime->scarf) (0.18.2)
Requirement already satisfied: decorator in /opt/python/3.9.14/lib/python3.9/site-packages (from ipython->ipython-autotime->scarf) (5.1.1)
Requirement already satisfied: backcall in /opt/python/3.9.14/lib/python3.9/site-packages (from ipython->ipython-autotime->scarf) (0.2.0)
Requirement already satisfied: pygments>=2.4.0 in /opt/python/3.9.14/lib/python3.9/site-packages (from ipython->ipython-autotime->scarf) (2.15.1)
Requirement already satisfied: pexpect>4.3 in /opt/python/3.9.14/lib/python3.9/site-packages (from ipython->ipython-autotime->scarf) (4.8.0)
Requirement already satisfied: typing-extensions in /opt/python/3.9.14/lib/python3.9/site-packages (from ipython->ipython-autotime->scarf) (4.5.0)
Collecting locket
Downloading locket-1.0.0-py2.py3-none-any.whl (4.4 kB)
Requirement already satisfied: six in /opt/python/3.9.14/lib/python3.9/site-packages (from patsy>=0.5.2->statsmodels->scarf) (1.16.0)
Collecting multipledispatch>=0.4.7
Downloading multipledispatch-0.6.0-py3-none-any.whl (11 kB)
Requirement already satisfied: parso<0.9.0,>=0.8.0 in /opt/python/3.9.14/lib/python3.9/site-packages (from jedi>=0.16->ipython->ipython-autotime->scarf) (0.8.3)
Requirement already satisfied: platformdirs>=2.5 in /opt/python/3.9.14/lib/python3.9/site-packages (from jupyter-core!=5.0.,>=4.12->ipykernel>=4.5.1->ipywidgets->scarf) (3.2.0)
Requirement already satisfied: ptyprocess>=0.5 in /opt/python/3.9.14/lib/python3.9/site-packages (from pexpect>4.3->ipython->ipython-autotime->scarf) (0.7.0)
Requirement already satisfied: wcwidth in /opt/python/3.9.14/lib/python3.9/site-packages (from prompt-toolkit!=3.0.37,<3.1.0,>=3.0.30->ipython->ipython-autotime->scarf) (0.2.6)
Requirement already satisfied: executing>=1.2.0 in /opt/python/3.9.14/lib/python3.9/site-packages (from stack-data->ipython->ipython-autotime->scarf) (1.2.0)
Requirement already satisfied: asttokens>=2.1.0 in /opt/python/3.9.14/lib/python3.9/site-packages (from stack-data->ipython->ipython-autotime->scarf) (2.2.1)
Requirement already satisfied: pure-eval in /opt/python/3.9.14/lib/python3.9/site-packages (from stack-data->ipython->ipython-autotime->scarf) (0.2.2)
Using legacy 'setup.py install' for umap-learn, since package 'wheel' is not installed.
Using legacy 'setup.py install' for pynndescent, since package 'wheel' is not installed.
Using legacy 'setup.py install' for asciitree, since package 'wheel' is not installed.
Using legacy 'setup.py install' for datashape, since package 'wheel' is not installed.
Building wheels for collected packages: hnswlib
Building wheel for hnswlib (pyproject.toml) ... done
Created wheel for hnswlib: filename=hnswlib-0.7.0-cp39-cp39-linux_x86_64.whl size=2157307 sha256=cbeeabf81c4880186a8985acb0844ce35ee916ddcaea5eec2846c68754f62c85
Stored in directory: /nfs/amishra/.cache/pip/wheels/ba/26/61/fface6c407f56418b3140cd7645917f20ba6b27d4e32b2bd20
Successfully built hnswlib
Installing collected packages: texttable, pytz, asciitree, tzdata, tqdm, toolz, threadpoolctl, smart-open, pyparsing, pybind11, pillow, param, numpy, networkx, multipledispatch, loguru, locket, llvmlite, kiwisolver, joblib, importlib-resources, igraph, fsspec, fonttools, fasteners, entrypoints, cycler, cloudpickle, scipy, pyct, patsy, partd, pandas, numcodecs, numba, leidenalg, hnswlib, h5py, datashape, contourpy, zarr, xarray, statsmodels, scikit-network, scikit-learn, matplotlib, kneed, gensim, dask, colorcet, seaborn, pynndescent, ipython-autotime, datashader, cmocean, umap-learn, scarf
ERROR: Could not install packages due to an OSError: [Errno 30] Read-only file system: '/opt/python/3.9.14/lib/python3.9/site-packages/texttable.py'

WARNING: You are using pip version 22.0.4; however, version 23.1 is available.
You should consider upgrading via the '/opt/python/3.9.14/bin/python3.9 -m pip install --upgrade pip' command.
Note: you may need to restart the kernel to use updated packages.

Docstrings, task list

List of docstrings that are currently missing:

writers.py
- write function in class ZarrMerge
- all dump methods
assay.py
- description in module level docstring
- {ADT,ATAC}assay
- the norm_* methods
- functions in {RNA,ADT}assay (e. g. normed and set_feature_stats)
meld_assay.py
- description in module level docstring
- function meld_assay
plots.py
- functions plot{_graph,}_qc, Are these for qc of a graph, or for plotting a qc plot?
readers.py
- H5adReader, description for attribute groupCodes
metadata.py
- several methods in class MetaData
ann.py
- module level docstring
- plus everything in it
dendrogram.py
- module level docstring
- CoalesceTree
- BalancedCut
plots.py
- most of the methods
markers.py
- find_markers_by_rank
umap.py
- no docstrings here.

loom file from kallisto/bustools import error:

Hi - I am using Scarf version '0.18.2' - the one installed in LS2 in the SingSingCell/1.1 singularity image.

I did this:

reader = scarf.readers.LoomReader( ifile, )
writer = scarf.CrToZarr(
    reader,
    zarr_fn='scarf_datasets/data/data.zarr',
    chunk_size=(2000, 1000)
)
writer.dump(batch_size=1000)

And I got this error:

AttributeError Traceback (most recent call last)
in
----> 1 writer = scarf.CrToZarr(
2 reader,
3 zarr_fn='scarf_datasets/HongzheAndPavan/data.zarr',
4 chunk_size=(2000, 1000)
5 )

/usr/local/lib/python3.8/dist-packages/scarf/writers.py in init(self, cr, zarr_fn, chunk_size, dtype)
184 self.z = zarr.open(self.fn, mode="w")
185 self._ini_cell_data()
--> 186 for assay_name in self.cr.assayFeats.columns:
187 create_zarr_count_assay(
188 self.z,

AttributeError: 'LoomReader' object has no attribute 'assayFeats'

After some research I assume that the class CrToZarr does not call the scarf.readers.LoomReader correctly - is that correct or is my loom file problematic? I'll look into that some more.

ZarrMerge producing empty counts while .zarr files used to merge have associated counts?

Thank you for producing this pipeline - it's quite helpful when dealing with huge sc datasets! I'm running into an issue when I try to merge two zarr objects.

I'm actually not throwing an error until I get to the mark_hvgs() step, but it's related to the ZarrMerge step not outputting cell counts by feature (i.e., the .zarr directory has no data files in test.zarr -> RNA -> counts).

Expected behavior

I would have expected that when I merge the two .zarr objects, that the DataStore object would not have "RNA assay has 0 (22032) features". All nCells are equal to 0. It's clear that the merge somehow did not summarize the counts. I played with the overwrite option and adjusting the update_key based on the vignette, but I can't seem to get the zarr files merged correctly. I am attempting to merge 115 .zarr dirs, but this seems to be problematic with only two .zarr.

The code I used is below (apologies if anything reported incorrectly, I haven't reported a ton of issues on git):

import re
import os
import scarf

a = scarf.DataStore(
         'S5464Nr1.zarr',
        nthreads = 4,
        min_features_per_cell = 1)
b = scarf.DataStore(
         'S5464Nr2.zarr',
        nthreads = 4,
        min_features_per_cell = 1)
scarf.ZarrMerge(
    zarr_path='test.zarr',
    assays=[a.RNA,b.RNA],
    names=['a','b'],
    merge_assay_name='RNA',
    overwrite=True
)
ds = scarf.DataStore(
    'test.zarr',
    nthreads = 4
)
ds

(RNA) Computing nCells and dropOuts: 100%|
1150/1150 [00:00]
(RNA) Computing nCounts: 100%|
1155/1155 [00:00]
WARNING: Minimum cell count (0) is lower than size factor multiplier (1000)
(RNA) Computing nFeatures: 100%|
1155/1155 [00:00]
(RNA) Computing RNA_percentMito: 100%|
1113/1113 [00:00]
WARNING: Percentage feature RNA_percentMito not added because not detected in any cell
(RNA) Computing RNA_percentRibo: 100%|
1113/1113 [00:00]
WARNING: Percentage feature RNA_percentRibo not added because not detected in any cell
WARNING: More than of half of the less have less than 10 features for assay: RNA. Will not remove low quality cells automatically.
DataStore has 20369 (20369) cells with 1 assays: RNA
   Cell metadata:
            'I', 'ids', 'names', 'RNA_nCounts', 'RNA_nFeatures', 
            'orig_RNA_nCounts', 'orig_RNA_nFeatures', 'orig_RNA_percentMito', 'orig_RNA_percentRibo'
   RNA assay has 0 (22032) features and following metadata:
            'I', 'ids', 'names', 'dropOuts', 'nCells',

ds.mark_hvgs(
    min_cells=20,
    top_n=2000,
    min_mean=-3,
    max_mean=2,
    max_var=6
)

/opt/anaconda3/lib/python3.9/site-packages/dask/array/slicing.py:647: RuntimeWarning: divide by zero encountered in long_scalars
  maxsize = math.ceil(nbytes / (other_numel * itemsize))

OverflowError                             Traceback (most recent call last)
/var/folders/b4/14kj85s94x52tlh_ptt0vf6r0000gn/T/ipykernel_63026/3363267796.py in <module>
----> 1 ds.mark_hvgs(
      2     min_cells=20,
      3     top_n=2000,
      4     min_mean=-3,
      5     max_mean=2,

/opt/anaconda3/lib/python3.9/site-packages/scarf/datastore.py in mark_hvgs(self, from_assay, cell_key, min_cells, top_n, min_var, max_var, min_mean, max_mean, n_bins, lowess_frac, blacklist, show_plot, hvg_key_name, **plot_kwargs)
   3604                 f"of cells will be considered HVGs."
   3605             )
-> 3606         assay.mark_hvgs(
   3607             cell_key,
   3608             min_cells,

/opt/anaconda3/lib/python3.9/site-packages/scarf/assay.py in mark_hvgs(self, cell_key, min_cells, top_n, min_var, max_var, min_mean, max_mean, n_bins, lowess_frac, blacklist, hvg_key_name, show_plot, **plot_kwargs)
    938             return f"{identifier}_{x}"
    939 
--> 940         self.set_feature_stats(cell_key, min_cells)
    941         identifier = self._load_stats_loc(cell_key)
    942         c_var_col = f"c_var__{n_bins}__{lowess_frac}"

/opt/anaconda3/lib/python3.9/site-packages/scarf/assay.py in set_feature_stats(self, cell_key, min_cells)
    830             return None
    831         n_cells = show_dask_progress(
--> 832             (self.normed(cell_idx, feat_idx) > 0).sum(axis=0),
    833             f"({self.name}) Computing nCells",
    834             self.nthreads,

/opt/anaconda3/lib/python3.9/site-packages/scarf/assay.py in normed(self, cell_idx, feat_idx, renormalize_subset, log_transform, **kwargs)
    792         if feat_idx is None:
    793             feat_idx = self.feats.active_index("I")
--> 794         counts = self.rawData[:, feat_idx][cell_idx, :]
    795         norm_method_cache = self.normMethod
    796         if log_transform:

/opt/anaconda3/lib/python3.9/site-packages/dask/array/core.py in __getitem__(self, index)
   1798 
   1799         out = "getitem-" + tokenize(self, index2)
-> 1800         dsk, chunks = slice_array(out, self.name, self.chunks, index2, self.itemsize)
   1801 
   1802         graph = HighLevelGraph.from_collections(out, dsk, dependencies=[self])

/opt/anaconda3/lib/python3.9/site-packages/dask/array/slicing.py in slice_array(out_name, in_name, blockdims, index, itemsize)
    172 
    173     # Pass down to next function
--> 174     dsk_out, bd_out = slice_with_newaxes(out_name, in_name, blockdims, index, itemsize)
    175 
    176     bd_out = tuple(map(tuple, bd_out))

/opt/anaconda3/lib/python3.9/site-packages/dask/array/slicing.py in slice_with_newaxes(out_name, in_name, blockdims, index, itemsize)
    194 
    195     # Pass down and do work
--> 196     dsk, blockdims2 = slice_wrap_lists(out_name, in_name, blockdims, index2, itemsize)
    197 
    198     if where_none:

/opt/anaconda3/lib/python3.9/site-packages/dask/array/slicing.py in slice_wrap_lists(out_name, in_name, blockdims, index, itemsize)
    260     if all(is_arraylike(i) or i == slice(None, None, None) for i in index):
    261         axis = where_list[0]
--> 262         blockdims2, dsk3 = take(
    263             out_name, in_name, blockdims, index[where_list[0]], itemsize, axis=axis
    264         )

/opt/anaconda3/lib/python3.9/site-packages/dask/array/slicing.py in take(outname, inname, chunks, index, itemsize, axis)
    645         warnsize = maxsize = math.inf
    646     else:
--> 647         maxsize = math.ceil(nbytes / (other_numel * itemsize))
    648         warnsize = maxsize * 5
    649 

OverflowError: cannot convert float infinity to integer

Scarf and Python version
Python: 3.9.7
Scarf: 0.19.6

How do you reset the column "I" ?

I loaded dataset as scarf.DataStore with the zarr directory then ran through the pipeline, i guess i made an error in the filter_cells function, now all the feats in column I are labelled False.

Therefore, when running mark_hvgs i keep getting an error:

ds.mark_hvgs(
min_cells=20,
top_n=500,
min_mean=3,
max_mean=2,
max_var=6
)

Error:
WARNING: WARNING: Number of valid features are less then value of parameter top_n: 500. Resetting top_n to 0

IndexError Traceback (most recent call last)
/tmp/ipykernel_32217/1850153007.py in
----> 1 ds.mark_hvgs(
2 min_cells=20,
3 top_n=500,
4 min_mean=3,
5 max_mean=2,

~/anaconda3/envs/scarf_env/lib/python3.8/site-packages/scarf/datastore.py in mark_hvgs(self, from_assay, cell_key, min_cells, top_n, min_var, max_var, min_mean, max_mean, n_bins, lowess_frac, blacklist, show_plot, hvg_key_name, **plot_kwargs)
3565 f"of cells will be considered HVGs."
3566 )
-> 3567 assay.mark_hvgs(
3568 cell_key,
3569 min_cells,

~/anaconda3/envs/scarf_env/lib/python3.8/site-packages/scarf/assay.py in mark_hvgs(self, cell_key, min_cells, top_n, min_var, max_var, min_mean, max_mean, n_bins, lowess_frac, blacklist, hvg_key_name, show_plot, **plot_kwargs)
932 top_n = n_valid_feats - 1
933 min_var = (
--> 934 pd.Series(self.feats.fetch_all(col_renamer(c_var_col)))[idx]
935 .sort_values(ascending=False)
936 .values[top_n]

IndexError: index -1 is out of bounds for axis 0 with size 0

Could you advise on how to reset the column I or do i have to restart the piepline from re-creating the zarr files?

Thank you
best regards
Monib

Missing figure legends in PNGs

Hi,

When exporting UMAPs as PNGs with ds.plot_layout() the figure legend is cut off. Possible to fix?

Example code:
ds.plot_layout(layout_key = "RNA_UMAP", color_by = "maxHTO", cmap= "Set3", savename=f"{sample}_UMAP.png")

best,

Rebecca

run_clustering() error using Paris clustering algorithm

Hi,
I am using Scarf version 0.19.6 on Python 3.9.1 and was following the workflow for scRNA-seq on https://scarf.readthedocs.io/en/latest/vignettes/basic_tutorial_scRNAseq.html , I am getting this error while running Paris clustering algorithm using run_clustering method.

ZeroDivisionError                         Traceback (most recent call last)
Input In [77], in <cell line: 1>()
----> 1 ds.run_clustering(n_clusters=leiden_clusters.nunique()[0])

File ~/miniconda3/envs/scarf_env/lib/python3.9/site-packages/scarf/datastore.py:2013, in GraphDataStore.run_clustering(self, from_assay, cell_key, feat_key, n_clusters, integrated_graph, symmetric_graph, graph_upper_only, balanced_cut, max_size, min_size, max_distance_fc, force_recalc, label)
   2004 paris = skn.hierarchy.Paris(reorder=False)
   2005 graph = self.load_graph(
   2006     from_assay=from_assay,
   2007     cell_key=cell_key,
   (...)
   2011     graph_loc=graph_loc,
   2012 )
-> 2013 dendrogram = paris.fit_transform(graph)
   2014 dendrogram[dendrogram == np.Inf] = 0
   2015 g = create_zarr_dataset(
   2016     self.z[graph_loc],
   2017     dendrogram_loc.rsplit("/", 1)[1],
   (...)
   2020     (graph.shape[0] - 1, 4),
   2021 )

File ~/miniconda3/envs/scarf_env/lib/python3.9/site-packages/sknetwork/hierarchy/base.py:40, in BaseHierarchy.fit_transform(self, *args, **kwargs)
     32 def fit_transform(self, *args, **kwargs) -> np.ndarray:
     33     """Fit algorithm to data and return the dendrogram. Same parameters as the ``fit`` method.
     34 
     35     Returns
   (...)
     38         Dendrogram.
     39     """
---> 40     self.fit(*args, **kwargs)
     41     return self.dendrogram_

File sknetwork/hierarchy/paris.pyx:290, in sknetwork.hierarchy.paris.Paris.fit()

ZeroDivisionError: float division

To Reproduce

ds.run_clustering(n_clusters=leiden_clusters.nunique()[0])

Link to vignettes in docs pages

Currently the linking to vignettes does not work in the docs pages (/docs/source/vignettes.rst). What we did in Nabo was to keep the vignettes directory in the docs dir. In contrast it is a separate dir here (/vignettes). I've been looking into linking to external files (e. g. https://stackoverflow.com/a/17217041/6301083) but it seems sphinx does not like that and it's not trivial to get it working. Am I missing something, how it's supposed to work, and it is working on your end @parashardhapola ? Or should I take the lead on a solution for this?

Rename ds.cells.columns?

Is there a function to rename ds.cells.columns?
Tried ds.cells.rename(columns = {A : B}), couldn't find info in tutorial or API.

High memory consumption by Dask arrays

Using versions of Dask >2021.03.1 might cause high memory consumption when using Scarf.
The temporary solution is to install dask==2021.03.1. This version has been pinned in the requirements.txt file.

The issue is being traced here on Dask repo: dask/dask#7583

clusters input for TopACeDo

Hi,
I want to use TopACeDo to downsample the dataset. It seems that pre-defined clusters are required for TopACeDo.
I am wondering how big clustering methods or resolution value can affect the downsampling cells results. Thanks.

adjusting filtering parameters in ds.filter_cells()

Hi,

How do I adjust filtering parameters during cell filtering? If I try to change the first given values for "lows" and "highs" the plot axes wont change when re-plotting the cell distributions. I can reset the values if i convert the hf5 file to a datastore object again, but this is quite tedious. Is there a another way around this?

reader = scarf.CrH5Reader(S1, file_type='rna')
writer = scarf.CrToZarr(reader, "S1.zarr", chunk_size=(2000, 5000))
writer.dump()

ds = scarf.DataStore("S1.zarr", nthreads=6, default_assay='RNA')
ds.plot_cells_dists()
ds.filter_cells(attrs=['RNA_nFeatures', 'RNA_percentMito'], lows=[1000, -1], highs=[7000, 15])
ds.plot_cells_dists(cell_key='I', color='coral')

Error: Conversion Anndata to Scarf’s Zarr format file

Discussed in #76

^{Originally posted by drewmard September 6, 2022}
Hi! I saw the recent paper in Nat Comm and the scarf package looks great! I have a large scRNA-seq dataset and I am interested in switching some of our pipelines over from Seurat/Scanpy to Scarf. However, I am running into an error when converting anndata to zarr file formats.

I am using an example anndata file:

>>> adata
AnnData object with n_obs × n_vars = 12479 × 33538
    obs: 'nCount_RNA', 'nFeature_RNA', 'percent.mt'
    var: 'name'

The code I am running is:

reader = scarf.H5adReader(
    f_in, 
    cell_ids_key = 'index',               # Where Cell/barcode ids are saved under 'obs' slot
    feature_ids_key = 'name',            # Where gene ids are saved under 'var' slot
    feature_name_key = 'name'  # Where gene names are saved under 'var' slot
)  

# change value of `zarr_fn` to your choice of filename and path
writer = scarf.H5adToZarr(
    reader,
    zarr_fn=f_out
)
writer.dump()

And the final error is for writer.dump():

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/amarder/anaconda3/envs/scarf_env/lib/python3.9/site-packages/scarf/writers.py", line 394, in dump
    store.set_coordinate_selection((a.row + s, a.col), a.data)
  File "/home/amarder/anaconda3/envs/scarf_env/lib/python3.9/site-packages/zarr/core.py", line 1608, in set_coordinate_selection
    indexer = CoordinateIndexer(selection, self)
  File "/home/amarder/anaconda3/envs/scarf_env/lib/python3.9/site-packages/zarr/indexing.py", line 714, in __init__
    boundscheck_indices(dim_sel, dim_len)
  File "/home/amarder/anaconda3/envs/scarf_env/lib/python3.9/site-packages/zarr/indexing.py", line 454, in boundscheck_indices
    raise BoundsCheckError(dim_len)
zarr.errors.BoundsCheckError: index out of bounds for dimension with length 12479

Here is the full output:

>>> reader = scarf.H5adReader(
...     f_in,
...     cell_ids_key = 'index',               # Where Cell/barcode ids are saved under 'obs' slot
...     feature_ids_key = 'name',            # Where gene ids are saved under 'var' slot
...     feature_name_key = 'name'  # Where gene names are saved under 'var' slot
... )
WARNING: `obsm` group not found in the H5ad file
>>> 
>>> # change value of `zarr_fn` to your choice of filename and path
>>> writer = scarf.H5adToZarr(
...     reader,
...     zarr_fn=f_out
... )
INFO: No value provided for assay names. Will use default value: 'RNA'
WARNING: Could not find cells ids key: index in `obs`.
Reading attributes from group obs:   0%|                                                                                         
                                                                                                                                 
                                                                                                                        Reading a
ttributes from group obs:  25%| █████████████████████████████████████████████████████████████████████████████████████████████████
██████████████▊                                                                                                                  
                                                                                                               Reading attributes
 from group obs:  75%| ██████████████████████████████████████████████████████████████████████████████████████████████████████████
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████
████████████████████████████████████████████████████████████████████████████████████████████████████▎ Reading attributes from gro
up obs: 100%| ███████████████████████████████████████████████████████████████████████████████████████████████████████████████████
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████
██████████████████████████████████████████████████████████████████████████ 4/4 [00:00]
WARNING: Reading of obsm failed because it either does not exist or is not in expected format
Reading attributes from group var:   0%|                                                                                         
                                                                                                                                 
                                                                                                                        Reading attributes from group var:  50%| ███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                                                                                                 Reading attributes from group var: 100%| ███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 2/2 [00:00]
	
>>> writer.dump()
  0%|                                                                                                                                                                                                                                                                                                                                                                                       8%| █████████████████████████████████████                                                                                                                                                                                                                                                                                                                                                15%| ██████████████████████████████████████████████████████████████████████████                                                                                                                                                                                                                                                                                                           23%| ███████████████████████████████████████████████████████████████████████████████████████████████████████████████                                                                                                                                                                                                                                                                      31%| ████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                                                                                                                                                                                                                 38%| █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                                                                                                                                                                            46%| ██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                                                                                                                                       54%| ███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                                                                                                  62%| ████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                                                             69%| █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                        77%| █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏   85%| ████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 92%| ████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 92%| ███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                      12/13 [00:09]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/amarder/anaconda3/envs/scarf_env/lib/python3.9/site-packages/scarf/writers.py", line 394, in dump
    store.set_coordinate_selection((a.row + s, a.col), a.data)
  File "/home/amarder/anaconda3/envs/scarf_env/lib/python3.9/site-packages/zarr/core.py", line 1608, in set_coordinate_selection
    indexer = CoordinateIndexer(selection, self)
  File "/home/amarder/anaconda3/envs/scarf_env/lib/python3.9/site-packages/zarr/indexing.py", line 714, in __init__
    boundscheck_indices(dim_sel, dim_len)
  File "/home/amarder/anaconda3/envs/scarf_env/lib/python3.9/site-packages/zarr/indexing.py", line 454, in boundscheck_indices
    raise BoundsCheckError(dim_len)
zarr.errors.BoundsCheckError: index out of bounds for dimension with length 12479

Any suggestions would be appreciated. Thanks in advance!
-Andrew Marderstein

Drop column not dropping

I'm trying to remove a column in cells. Used ds.cells.drop('column_name') and nothing happened.

Loom file reader?

A loom file reader would be nice to directly read from kallisto out files.

export cluster markers

Hi,
I'm trying to export cluster markers to a csv file. And thus I tried: "ds.export_markers_to_csv(from_assay='RNA', group_key= 'RNA_leiden_cluster', csv_filename= './tables/cluster_markers/Cluster_markers.csv')".
And got the error message "AttributeError: module 'scarf' has no attribute 'export_markers_to_csv'".

cannot make UMAP

Package name on PyPi

With release of version 0.9.0 scarf was recently renamed to scarf from scarf-toolkit on PyPi. So if you previously installed using pip install scarf-toolkit then you need to first uninstall using pip uninstall scarf-toolkit. Thereafter you can install the recent version of Scarf using pip install -U scarf

ds.to_anndata() and scarf.writers.to_h5ad() crteate two different data structures

When I use the to_anndata() function in scarf 0.20.0 I get an anndata object with all requred information ([20] in the script); If I use the to_h5ad I get only a minimal anndata object lacking all UMP/TSNE information [25].
In addition it would be cool dimension reduction data could be stored in an obsm table instead of an obs table mimicking the scanpy behavior.

A detailed description of what I did is in the attached html script file.

concat_1_EX05.pdf

Constructing a scarf Zarr dataset from scratch

Hello there. I've been thinking a lot lately about Dask and scRNAseq analysis so I'm excited to give this a try! We have a ton of data to analyze and most tools aren't able to scale well enough, so having something that can distribute the work is great.

The first step I'm figuring out is how to get the data into the right format. I already have a very large dataset (10X snRNAseq) in Zarr format, stored as a cell-by-gene matrix of UMI counts, so I would prefer not to go through the CR import process but just tweak the layout to be compatible with Scarf. Is that reasonable?

I think I can figure this out from the docs but I thought I'd open an issue in case there's an easy answer. I am guessing it's just a matter of naming things correctly.

How to add feature name changed by gene name?

The names are not correct. How can I replace it by gene_name?

Normalization of ADTs on merged datasets

In a merged data set of sample A and B.
If I color the umap based on the normed ADT expression as written below..

ds.plot_layout(layout_key='RNA_UMAP', from_assay='ADT', color_by= f"ADT_X", cmap='coolwarm', clip_fraction=0.05, cell_key = "I")

...the full sample (cell_key ="I") looks ok.
But if I set cell_key = sample A or sample B separately the expression is lost for sample A.
Could something be off with the ADT normalization when you split the data set with cell_key?

Thanks!

How can I generate the UMAP like this (cluster centroids). Can I also change the format and position of this figure?

ATAC-seq recognized as RNA-seq?

Hi staffs of scarf, I downloaded Cusanovich's 81K sc-ATAC-seq dataset using your fetch_dataset(),
but when I was importing the dataset, this happened:

reader = scarf.CrDirReader(temp_path + '/cusanovich_81K_mouse_atacseq',mtx_separator='/t')
reader.assayFeats

RNA
type	Gene Expression
start	0
end	167013
nFeatures	167013

somehow it was auto-recognized as a RNA-seq assay, since file_type is deprecated I cannot either change the assay type or go on my downstream analysis.
Any possible solutions?

parashardhapola / scarf Goto Github PK

scarf's People

Contributors

Stargazers

Watchers

Forkers

scarf's Issues

New functionality

Improved support

Visualization

Support for formats and dataset management

And I got this error:

Discussed in #76

Recommend Projects

Recommend Topics

Recommend Org