scverse / anndata Goto Github PK
View Code? Open in Web Editor NEWAnnotated data.
Home Page: http://anndata.readthedocs.io
License: BSD 3-Clause "New" or "Revised" License
Annotated data.
Home Page: http://anndata.readthedocs.io
License: BSD 3-Clause "New" or "Revised" License
Is there a plan to add data subset method for extracting some particular cells for the anndata object ?
I've found that I can select multiple rows and then iterate over their columns and values like this:
# Trying to find how to get one gene
selected = adata[:, adata.var_names.isin(['Tcea1', 'Xkr4'])] # works
# this version only works if I'm using a sparse matrix, if not the tocoo() call fails.
#cx = adata.X.tocoo()
#for cell, gene, value in zip(adata.obs_names[cx.row], adata.var_names[cx.col], cx.data):
# print(cell, gene, value)
# This is for a complete matrix
for g, gene in enumerate(selected.var_names):
for c, cell in enumerate(selected.obs_names):
print("{0}\t{1}\t{2}".format(cell, gene, selected.X[c, g]))
But now I want to select just ONE row and iterate over its columns and values.
These do not work.
selected = adata[:, adata.var_names.isin(['Tcea1',])] # this doesn't
selected = adata[:, adata.var_names['Tcea1']] # this doesn't
``
What actually fails each time though isn't getting selected, it's trying to print it:
Traceback (most recent call last):
File "./h5ad_test_read.py", line 35, in
print("{0}\t{1}\t{2}".format(cell, gene, selected.X[c, g]))
IndexError: too many indices for array
I'd love to see an example of doing this correctly for a single gene.
Trying to index into an AnnData
object which has integer obs_names
throws an assertion error. I expected either not allowing the construction an object with integer obs_names
or allowing indexing into them.
Here's a quick example. First I instantiate an AnnData
object, give it integers for observation names, then get an error when I try to index into it:
>>> import scanpy.api as sc
>>> import pandas as pd
>>> import numpy as np
>>> adata = sc.datasets.krumsiek11()
Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
... storing 'cell_type' as categorical
>>> adata.obs_names = np.array(range(adata.n_obs))
>>> adata[:, ['Gata2', 'Gata1']]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.6/site-packages/anndata/base.py", line 1211, in __getitem__
return self._getitem_view(index)
File "/usr/local/lib/python3.6/site-packages/anndata/base.py", line 1214, in _getitem_view
oidx, vidx = self._normalize_indices(index)
File "/usr/local/lib/python3.6/site-packages/anndata/base.py", line 1190, in _normalize_indices
obs = _normalize_index(obs, self.obs_names)
File "/usr/local/lib/python3.6/site-packages/anndata/base.py", line 231, in _normalize_index
'Don’t call _normalize_index with non-categorical/string names'
AssertionError: Don’t call _normalize_index with non-categorical/string names
This error in indexing can be recovered by using a pandas.RangeIndex
for the observation names:
>>> adata.obs_names = pd.RangeIndex(stop=adata.n_obs)
>>> adata[:, ['Gata2', 'Gata1']]
View of AnnData object with n_obs × n_vars = 640 × 2
obs: 'cell_type'
uns: 'iroot', 'highlights'
However, range indexes are frequently implicitly replaced with integer indexes:
>>> adata_norm = sc.pp.normalize_per_cell(adata, copy=True)
>>> adata_norm.obs_names
Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
...
630, 631, 632, 633, 634, 635, 636, 637, 638, 639],
dtype='int64', length=640)
>>> adata_norm[:, ['Gata2', 'Gata1']]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.6/site-packages/anndata/base.py", line 1211, in __getitem__
return self._getitem_view(index)
File "/usr/local/lib/python3.6/site-packages/anndata/base.py", line 1214, in _getitem_view
oidx, vidx = self._normalize_indices(index)
File "/usr/local/lib/python3.6/site-packages/anndata/base.py", line 1190, in _normalize_indices
obs = _normalize_index(obs, self.obs_names)
File "/usr/local/lib/python3.6/site-packages/anndata/base.py", line 231, in _normalize_index
'Don’t call _normalize_index with non-categorical/string names'
AssertionError: Don’t call _normalize_index with non-categorical/string names
Thanks!
@Koncopd Let's discuss the issue here.
In essence, we want to have loom's layers functionality also for AnnData in order to deal with replicated data matrices as produced by the velocyto command line tool.
The most basic thing we need is the iteration over the layers in the loom file and their corresponding initialization in the AnnData file, which would be an extension of read_loom
(https://github.com/theislab/anndata/blob/86ede1effa86a5d88db18d71c21c4057887066b5/anndata/readwrite/read.py#L126-L155)
adata.layer[key] = loomconnection.layer['key'][:, :].T
Main questions are: how to elegantly combine the X
and the layers
group? Shall we call it .layers_X
for more verbosity and stressing of the fact that we force the dimensions to be the same? How to deal with the transposition: ideally, when in backed mode, we don't want to load everything in the loom file into memory but rather convert the file into an .h5ad
file.
Let's start with memory mode and some simple functionality, though...
Clean up https://anndata.readthedocs.io/en/latest/benchmarks.html: move https://github.com/Koncopd/anndata-scanpy-benchmarks to anndata_usage and provide links to nbviewer similar to how it’s done on https://scanpy.readthedocs.io/en/latest/examples.html.
Fixed by retaining 2D storage via #55
Hi @Koncopd,
would you write a proper documentation of the .layers
attribute you built? Currently, it's only a very non-informative stub:
https://anndata.readthedocs.io/en/latest/anndata.AnnData.layers.html
It should contain a reference to loompy, scvelo and very basic examples.
Here, even the heading in the table is missing:
https://anndata.readthedocs.io/en/latest/anndata.AnnData.html
Hey,
adata = sc.datasets.paul15()
adata._get_obs_array('Sfpi1', use_raw=True)
raises ValueError: Did not find Sfpi1 in obs.keys or var_names.
However the correct exception should be ".raw doesn't exist" or so.
How I ended up with this was actually sc.pl.scatter(adata, 'Sfpi1', 'Gata1')
, which raises the same exception. Raising an exception explicitly about the lack of .raw
would be much more informative for users.
I guess @Koncopd added the layer support, so he might be interested.
Hi, I recently stumbled about the following problem:
In[98]: adata
Out[98]:
AnnData object with n_obs × n_vars = 20728 × 32738
obs: 0, 'batch', 'condition', 'source'
var: 0, 1
In[99]: adata.write(file)
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2963, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-99-c21b8d69b5d6>", line 1, in <module>
adata.write(file)
File "/usr/lib/python3.6/site-packages/anndata/base.py", line 1779, in write
_write_h5ad(filename, self, compression=compression, compression_opts=compression_opts)
File "/usr/lib/python3.6/site-packages/anndata/readwrite/write.py", line 94, in _write_h5ad
d = adata._to_dict_fixed_width_arrays()
File "/usr/lib/python3.6/site-packages/anndata/base.py", line 1926, in _to_dict_fixed_width_arrays
obs_rec, uns_obs = df_to_records_fixed_width(self._obs)
File "/usr/lib/python3.6/site-packages/anndata/base.py", line 176, in df_to_records_fixed_width
uns[k + '_categories'] = c.cat.categories.values
TypeError: unsupported operand type(s) for +: 'int' and 'str'
As it seems, df_to_records_fixed_width
has problems when some column names are actually integers.
The following solves this problem:
adata.var.columns = adata.var.columns.astype(str)
adata.obs.columns = adata.obs.columns.astype(str)
Hi,
I observed the following issue:
However, after applying the concatenate function, I ended up with more features even though I definitely had the same number of genes in both parts.
cc @mbuttner
Here is the example:
import scanpy.api as sc
sc.settings.verbosity = 0
adata = sc.datasets.blobs()
sc.pp.neighbors(adata)
sc.tl.louvain(adata)
sc.tl.rank_genes_groups(adata, 'louvain')
adata.rename_categories('louvain', {'1': 'a'})
throws the error
ValueError Traceback (most recent call last)
<ipython-input-29-5a4b44b293eb> in <module>()
6 sc.tl.rank_genes_groups(adata, 'louvain')
7
----> 8 adata.rename_categories('louvain', {'1': 'a'})
~/miniconda3/envs/spols180816d/lib/python3.6/site-packages/anndata/base.py in rename_categories(self, key, categories)
1359 if isinstance(v2, np.ndarray) and v2.dtype.names is not None:
1360 if list(v2.dtype.names) == old_categories:
-> 1361 self.uns[k1][k2].dtype.names = categories
1362 else:
1363 logg.warn(
ValueError: must replace all names at once with a sequence of length 5
The problem does not appear if I skip sc.tl.rank_genes_groups(adata, 'louvain')
. I am on the master branch of scanpy. Recently, sc.tl.rank_genes_groups
reports p-values and so on. Could that be related to the error?
read_csv
delegates to read_txt
, which doesn’t exist: NameError: name 'read_txt' is not defined
I’m on it
I believe pathlib needs to be added as a dependency.
> pip install anndata
Collecting anndata
Using cached https://files.pythonhosted.org/packages/a2/31/abf1918b45012977f1f78de6cdd01ee6c3650acae538ff8f7b0b17c1f47f/anndata-0.6.5.tar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "c:\users\scott\appdata\local\temp\pip-build-lzqa4a\anndata\setup.py", line 2, in <module>
from pathlib import Path
ImportError: No module named pathlib
This is with Python 2.7 on Windows.
Hi,
thanks for putting this out there, python needed an annotated data format :)
I am trying to use the diffusion maps from scanpy, and it requires me to format my input array in AnnData format. Now, according to the documentation, I should be able to call AnnData with a numpy.ndarray and without any annotation. However, when I do that, even with ad = AnnData(np.ones((2, 2)))
, like in the documentation, I get a TypeError:
TypeError Traceback (most recent call last)
<ipython-input-8-1792540ebafc> in <module>()
----> 1 ad = AnnData(X1)
~/miniconda2/envs/py35/lib/python3.5/site-packages/anndata/anndata.py in __init__(self, data, smp, var, uns, smpm, varm, dtype, single_col)
334
335 # multi-dimensional array annotations
--> 336 if smpm is None: smpm = np.empty(self._n_smps, dtype=[])
337 if varm is None: varm = np.empty(self._n_vars, dtype=[])
338 self._smpm = BoundRecArr(smpm, self, 'smpm')
TypeError: Empty data-type
the error persists in jupyter notebooks and in the python console. Any ideas what may be causing that?
cheers,
Niko
Anndata __getitem__
seems to exhibit a cross-product (similar to Python slices or np.ix_()
) behaviour instead of numpy fancy indexing behaviour (so when it's indexed with 3 rows and 3 columns, it returns a 3x3 anndata, not 3 scalars as in fancy indexing).
This is very useful and intuitive, in my opinion, because when users specify cell and gene indices they mean cell and gene filtering:
adata = sc.datasets.paul15_raw()
print(adata[[0, 4, 10], ['Sfpi1']])
print(adata[[0, 4, 10], ['Sfpi1', 'Gata1', 'Fli1']])
However, this is the case only if the col/row index dimensions are compatible in terms of broadcasting:
print(adata[[0, 4, 10], ['Sfpi1', 'Gata1']])
This is a bit difficult to understand :) Either all 3 cases should raise an exception or they should all perform cross-product-like slicing. What do you think?
In 4233e07 we started calling logging.basicConfig
and in 40bdcbb we changed the global log level to INFO. This means that everyone’s logging configuration is overridden (not nice) and they start seeing INFO-level noise from all modules!
Python modules should never call any code with global side effects on import. logging.basicConfig
goes into __main__
.
@falexwolf said that the goal was to have similar output to scanpy’s. The problem is that scanpy uses its own logging infrastructure instead of python’s. My proposal:
We can still use our own logging format and set scanpy and anndata to INFO level like this:
logger = logging.getLogger(__name__)
logger.propagate = False # Don’t pass log messages on to the root logger and its handler
logger.setLevel('INFO')
handler = logging.StreamHandler(sys.stderr) # Why did we use stdout?
handler.setFormatter(logging.Formatter('%(message)s'))
handler.setLevel('INFO')
logger.addHandler(handler)
Hey guys,
What's the easiest way to remove a field in adata.obsm?
for example, I have
AnnData object with n_obs × n_vars = 48011 × 25583
obs: 'n_genes', 'n_counts', 'percent_mito', 'Sample', 'Donor', 'Tissue', 'batch', 'DCA_split', 'size_factors'
var: 'gene_ids', 'n_counts'
uns: 'DCA_losses'
obsm: 'X_dca', 'X_dca_mean', 'X_dca_hidden', 'X_dca_dropout', 'X_dca_dispersion'
and I just want to remove the items in obsm to reduce memory consumption.
Hi,
I converted a Seurat object into a .loom
file and tried to read it into Scanpy using the read_loom()
function. I got the following error:
data = sc.read_loom(filename)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-5-c4e72203067d> in <module>()
----> 1 data = sc.read_loom(filename)
/home/yueqi/anaconda/envs/py36/lib/python3.6/site-packages/anndata/readwrite/read.py in read_loom(filename)
149 X.T,
150 obs=lc.col_attrs,
--> 151 var=lc.row_attrs)
152 lc.close()
153 return adata
/home/yueqi/anaconda/envs/py36/lib/python3.6/site-packages/anndata/base.py in __init__(self, X, obs, var, uns, obsm, varm, raw, dtype, single_col, filename, filemode, asview, oidx, vidx)
753 obsm=obsm, varm=varm, raw=raw,
754 dtype=dtype, single_col=single_col,
--> 755 filename=filename, filemode=filemode)
756
757 def _init_as_view(self, adata_ref, oidx, vidx):
/home/yueqi/anaconda/envs/py36/lib/python3.6/site-packages/anndata/base.py in _init_as_actual(self, X, obs, var, uns, obsm, varm, raw, dtype, single_col, filename, filemode)
889 # annotations
890 self._obs = _gen_dataframe(obs, self._n_obs,
--> 891 ['obs_names', 'row_names', 'smp_names'])
892 self._var = _gen_dataframe(var, self._n_vars, ['var_names', 'col_names'])
893
/home/yueqi/anaconda/envs/py36/lib/python3.6/site-packages/anndata/base.py in _gen_dataframe(anno, length, index_names)
228 break
229 else:
--> 230 _anno = pd.DataFrame(anno)
231 return _anno
232
/home/yueqi/anaconda/envs/py36/lib/python3.6/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
402 dtype=values.dtype, copy=False)
403 else:
--> 404 raise ValueError('DataFrame constructor not properly called!')
405
406 NDFrame.__init__(self, mgr, fastpath=True)
ValueError: DataFrame constructor not properly called!
I revolved this error by changing _anno = pd.DataFrame(anno)
to _anno = pd.DataFrame(dict(anno))
in base.py
.
This is because the loompy
package extracts column and row annotations as generators rather than dictionaries, and pd.DataFrame
does not take generators as input.
Hope it's fixed in the future. Thanks!
Yueqi
Hi, I'd like to read a "barcode.tsv" file:
sc.read_csv("barcodes.tsv", delimiter="\t")
However, this file has only one column and cannot be loaded:
Traceback (most recent call last):
File "<ipython-input-24-7055edd4368a>", line 13, in load_mtx_to_adata
ad.obs = sc.read_csv("barcodes.tsv", delimiter="\t")
File "/usr/lib/python3.6/site-packages/anndata/readwrite/read.py", line 36, in read_csv
return read_text(filename, delimiter, first_column_names, dtype)
File "/usr/lib/python3.6/site-packages/anndata/readwrite/read.py", line 210, in read_text
return _read_text(f, delimiter, first_column_names, dtype)
File "/usr/lib/python3.6/site-packages/anndata/readwrite/read.py", line 243, in _read_text
.format(delimiter))
ValueError: Did not find delimiter " " in first line.
Hi guys,
Is there any way to store networkx object in anndata?
I tried to store it in adata.uns['networkx'] and saved it as a .h5ad-formatted file adata.write(results_file).
But when I read back in the object, the networkx object can't be restored. It's transformed into an array. I wonder what should be the right way to deal with networkx or other graph object in anndata.
Any suggestions would be much appreciated. Thanks!
I'd like to store some design matrix inside adata.obsm.
However, there are cases where this design matrix has only one column, i.e. it has shape (adata.n_obs, 1).
In this case I get the following error message:
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2961, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-28-86462c4012f3>", line 1, in <module>
adata.obsm["asdf"] = np.reshape(np.arange(1000), (1000,1))
File "/usr/lib/python3.6/site-packages/anndata/base.py", line 126, in __setitem__
new[name] = arr
ValueError: could not broadcast input array from shape (1000,1) into shape (1000)
Example to reproduce:
adata.obsm["asdf"] = np.reshape(np.arange(adata.n_obs), (adata.n_obs,1))```
Are there any scRNA-seq dataset in the AnnData format that are publicly available?
Thanks!
Hi there,
Sorry for the simple question, but I just started using anndata within Scanpy and was wondering: is there a way to remove a specific row? Something like
if
adata.var_names == "foo"
remove the row
(It is to remove some mitochondrial genes)
Thank you!
I'm not sure if this belongs here or in the scanpy repo, and this is a hybrid bug and feature request. This is also somewhat related to #31, as both issues stem from the same function.
If two sample annotations have the same "value" but different dtype, a dataset saved as h5ad
becomes unreadable. This stems from the way that the categories for 'object'
typed annotations are defined. Minimal example:
import scanpy.api as sc
# load any dataset:
dataset = sc.read('/path/to/dataset')
test1 = dataset.copy()[:5, :5]
test2 = dataset.copy()[:5, :5]
# add annotations
test1.obs['sampleid'] = 1
test2.obs['sampleid'] = '1'
test_combined = test1.concatenate([test2])
test_combined.save('test.h5ad')
test_combined = sc.read('test.h5ad')
The last line fails for me with:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-85-2cfbeaca8a55> in <module>()
----> 1 test_combined = sc.read_h5ad('test.h5ad')
/projects/flynnb/software/anaconda/envs/scc/lib/python3.6/site-packages/anndata/readwrite/read.py in read_h5ad(filename, backed)
343 # load everything into memory
344 d = _read_h5ad(filename=filename)
--> 345 return AnnData(d)
346
347
/projects/flynnb/software/anaconda/envs/scc/lib/python3.6/site-packages/anndata/base.py in __init__(self, X, obs, var, uns, obsm, varm, raw, dtype, shape, filename, filemode, asview, oidx, vidx)
634 obsm=obsm, varm=varm, raw=raw,
635 dtype=dtype, shape=shape,
--> 636 filename=filename, filemode=filemode)
637
638 def _init_as_view(self, adata_ref, oidx, vidx):
/projects/flynnb/software/anaconda/envs/scc/lib/python3.6/site-packages/anndata/base.py in _init_as_actual(self, X, obs, var, uns, obsm, varm, raw, dtype, shape, filename, filemode)
753 raise ValueError(
754 'If `X` is a dict no further arguments must be provided.')
--> 755 X, obs, var, uns, obsm, varm, raw = self._from_dict(X)
756
757 # init from AnnData
/projects/flynnb/software/anaconda/envs/scc/lib/python3.6/site-packages/anndata/base.py in _from_dict(ddata)
1872 d_true_keys['obs'][k_stripped] = pd.Categorical.from_codes(
1873 codes=d_true_keys['obs'][k_stripped].values,
-> 1874 categories=v)
1875 if k_stripped in d_true_keys['var']:
1876 d_true_keys['var'][k_stripped] = pd.Categorical.from_codes(
/projects/flynnb/software/anaconda/envs/scc/lib/python3.6/site-packages/pandas/core/categorical.py in from_codes(cls, codes, categories, ordered)
616 "codes need to be convertible to an arrays of integers")
617
--> 618 categories = CategoricalDtype._validate_categories(categories)
619
620 if len(codes) and (codes.max() >= len(categories) or codes.min() < -1):
/projects/flynnb/software/anaconda/envs/scc/lib/python3.6/site-packages/pandas/core/dtypes/dtypes.py in _validate_categories(categories, fastpath)
325
326 if not categories.is_unique:
--> 327 raise ValueError('Categorical categories must be unique')
328
329 if isinstance(categories, ABCCategoricalIndex):
ValueError: Categorical categories must be unique
Looking at the attributes of test_combined
, I see this:
test_combined.obs.sampleid.astype('category')
AAACCTGAGAACAACT-1-0 1
AAACCTGAGCTAGTTC-1-0 1
AAACCTGAGGGAAACA-1-0 1
AAACCTGCAATCACAC-1-0 1
AAACCTGCAATCGAAA-1-0 1
AAACCTGAGAACAACT-1-1 1
AAACCTGAGCTAGTTC-1-1 1
AAACCTGAGGGAAACA-1-1 1
AAACCTGCAATCACAC-1-1 1
AAACCTGCAATCGAAA-1-1 1
Name: sampleid, dtype: category
Categories (2, object): [1, 1]
and inspecting h5file['uns']['sampleid_categories']
yields [b'1', b'1']
. Because some of the values are strings, the dtype of the column in the dataframe gets set as 'object'
which causes is_string_dtype(data.obs.sampleid)
to be True
.
I think the logic in base. df_to_records_fixed_width should probably be changed to sanitize user-defined inputs like this or display a warning if mixed dtypes are detected.
I'm using anndata==0.6.4
and scanpy==1.2.1
.
Hi guys,
I'm really enjoying the efficiency and scalability of anndata. I have been using it to manage my large datasets. Excellent work!
But unfortunately i'm having bad luck in writing adata object to .h5ad-formatted hdf5 file.
adata.write(results_file)
It takes very long time and ends up with one of the errors below (I haven't managed to store adata object in file)
In adata object, I'm using different annotations including uns, obs, vars, obsm and it works smoothly without any errors. But each time i try to write it to file, the aforementioned errors are thrown out.
I will really appreciate your opinions. Thanks in advance!
When trying to create an AnnData from a pandas DataFrame I get the following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-18-235904346c66> in <module>()
----> 1 adata_prova = sc.AnnData(adata_to_df(adata=adata_p))
~/.pyenv/versions/3.6.4/lib/python3.6/site-packages/anndata/base.py in __init__(self, X, obs, var, uns, obsm, varm, layers, raw, dtype, shape, filename, filemode, asview, oidx, vidx)
681 layers=layers,
682 dtype=dtype, shape=shape,
--> 683 filename=filename, filemode=filemode)
684
685 def _init_as_view(self, adata_ref: 'AnnData', oidx: Index, vidx: Index):
~/.pyenv/versions/3.6.4/lib/python3.6/site-packages/anndata/base.py in _init_as_actual(self, X, obs, var, uns, obsm, varm, raw, layers, dtype, shape, filename, filemode)
824 class_names = ', '.join(c.__name__ for c in StorageType.classes())
825 raise ValueError('`X` needs to be of one of {}, not {}.'
--> 826 .format(class_names, type(X)))
827 if shape is not None:
828 raise ValueError('`shape` needs to be `None` is `X` is not `None`.')
ValueError: `X` needs to be of one of ndarray, MaskedArray, spmatrix, ZarrArray, not <class 'pandas.core.frame.DataFrame'>.
The line in the current version is:
I am using the latest (from git) version of scanpy (1.3.1+68.ga045533) and the latest published version of AnnData (0.6.10).
I am creating an AnnData using:
adata = sc.AnnData(df)
Hi, would it be possible to change the numpy/scipy dependencies to numy >= 1.14 and scipy >= 1.0?
anndata otherwise causes problems with other packages requiring newer versions of numpy/scipy.
Hey. I've run into a couple issues with reading in backed objects with a raw
representation.
The first is just the case of reading in an object with a raw attribute:
import scanpy.api as sc
import numpy as np
adata = sc.AnnData(X=np.random.binomial(100, .01, (100, 100)))
adata.raw = adata.copy()
sc.pp.log1p(adata) # Just so they are different
adata.write("./tmp.h5ad")
sc.read("tmp.h5ad", backed="r")
AttributeError Traceback (most recent call last)
<ipython-input-2-7e0cdbc773a6> in <module>()
3 sc.pp.log1p(adata) # Just so they are different
4 adata.write("./tmp.h5ad")
----> 5 sc.read("tmp.h5ad", backed="r")
/usr/local/lib/python3.6/site-packages/scanpy/readwrite.py in read(filename, backed, sheet, ext, delimiter, first_column_names, backup_url, cache, **kwargs)
73 return _read(filename, backed=backed, sheet=sheet, ext=ext,
74 delimiter=delimiter, first_column_names=first_column_names,
---> 75 backup_url=backup_url, cache=cache, **kwargs)
76 # generate filename and read to dict
77 filekey = filename
/usr/local/lib/python3.6/site-packages/scanpy/readwrite.py in _read(filename, backed, sheet, ext, delimiter, first_column_names, backup_url, cache, suppress_cache_warning, **kwargs)
274 if ext in {'h5', 'h5ad'}:
275 if sheet is None:
--> 276 return read_h5ad(filename, backed=backed)
277 else:
278 logg.msg('reading sheet', sheet, 'from file', filename, v=4)
/usr/local/lib/python3.6/site-packages/anndata/readwrite/read.py in read_h5ad(filename, backed)
407 if backed:
408 # open in backed-mode
--> 409 return AnnData(filename=filename, filemode=backed)
410 else:
411 # load everything into memory
/usr/local/lib/python3.6/site-packages/anndata/base.py in __init__(self, X, obs, var, uns, obsm, varm, layers, raw, dtype, shape, filename, filemode, asview, oidx, vidx)
681 layers=layers,
682 dtype=dtype, shape=shape,
--> 683 filename=filename, filemode=filemode)
684
685 def _init_as_view(self, adata_ref: 'AnnData', oidx: Index, vidx: Index):
/usr/local/lib/python3.6/site-packages/anndata/base.py in _init_as_actual(self, X, obs, var, uns, obsm, varm, raw, layers, dtype, shape, filename, filemode)
895 self,
896 X=raw['X'],
--> 897 var=_gen_dataframe(raw['var'], raw['X'].shape[1], ['var_names', 'col_names']),
898 varm=raw['varm'] if 'varm' in raw else None)
899
AttributeError: 'NoneType' object has no attribute 'shape'
Additionally the reader doesn't clean up after itself if it errors. In the same session:
adata.write("./tmp.h5ad")
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
<ipython-input-3-e04b48216e1b> in <module>()
----> 1 adata.write("./tmp.h5ad")
/usr/local/lib/python3.6/site-packages/anndata/base.py in write(self, filename, compression, compression_opts, force_dense)
1887
1888 _write_h5ad(filename, self, compression=compression, compression_opts=compression_opts,
-> 1889 force_dense=force_dense)
1890 if self.isbacked:
1891 self.file.close()
/usr/local/lib/python3.6/site-packages/anndata/readwrite/write.py in _write_h5ad(filename, adata, force_dense, **kwargs)
218 d['X'] = adata.X[:]
219 # need to use 'a' if backed, otherwise we loose the backed objects
--> 220 with h5py.File(filename, 'a' if adata.isbacked else 'w', force_dense=force_dense) as f:
221 for key, value in d.items():
222 _write_key_value_to_h5(f, key, value, **kwargs)
/usr/local/lib/python3.6/site-packages/anndata/h5py/h5sparse.py in __init__(self, name, mode, driver, libver, userblock_size, swmr, force_dense, **kwds)
139 userblock_size=userblock_size,
140 swmr=swmr,
--> 141 **kwds,
142 )
143 super().__init__(self.h5f, force_dense)
/usr/local/lib/python3.6/site-packages/h5py/_hl/files.py in __init__(self, name, mode, driver, libver, userblock_size, swmr, **kwds)
310 with phil:
311 fapl = make_fapl(driver, libver, **kwds)
--> 312 fid = make_fid(name, mode, userblock_size, fapl, swmr=swmr)
313
314 if swmr_support:
/usr/local/lib/python3.6/site-packages/h5py/_hl/files.py in make_fid(name, mode, userblock_size, fapl, fcpl, swmr)
146 fid = h5f.create(name, h5f.ACC_EXCL, fapl=fapl, fcpl=fcpl)
147 elif mode == 'w':
--> 148 fid = h5f.create(name, h5f.ACC_TRUNC, fapl=fapl, fcpl=fcpl)
149 elif mode == 'a':
150 # Open in append mode (read/write).
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
h5py/h5f.pyx in h5py.h5f.create()
OSError: Unable to create file (unable to truncate a file which is already open)
I get these errors using v0.6.10
and the current master branch.
Hi @flying-sheep,
the last modifications to the docs seem to have destroyed proper linking here, both in the Attributes and the Methods.
What do you think?
When slicing AnnData with a List of one (adata[:, [0]]
) element, I expect it to return a 2d array.
When slicing AnnData with one element (adata[:, 0]
), I expect it to return a 1d array.
AnnData always returns a 1d array.
See the following example
a = np.ones((3, 3))
adata = AnnData(a)
Expected behaviour (like in numpy)
> a[:, [0]]
array([1., 1., 1.])
> a[:, 0]
array([[1.],
[1.],
[1.]])
> adata.X[:, [0]]
array([[1.],
[1.],
[1.]], dtype=float32)
Actual behaviour:
> adata[:, 0].X
ArrayView([1., 1., 1.], dtype=float32)
> adata[:, [0]].X
This is somewhat related to #60. I still opened a separate issue as in my case the behavious is actually inconsistent with numpy.
when uns field is used to save some dictionary which has too many keys/values, the file size will increase massively. Expectedly, the reading and writing operation will be slow.
When I subset 3 times on this dataset, AnnData throws an AttributeError: 'AnnData' object has no attribute 'file'
. Interestingly, the first two subsets work as expected.
import scanpy.api as sc
import pandas as pd
import numpy as np
adata = sc.read_h5ad("adata.h5ad")
adata = adata[adata.obs['n_genes'] > 200, :]
adata = adata[adata.obs['n_genes'] > 200, :]
adata = adata[adata.obs['n_genes'] > 200, :]
(it doesn't matter if I subset with different values or on different columns)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-6-7835ead5dcea> in <module>
----> 1 adata = adata[adata.obs['n_genes'] > 200, :]
~/.conda/envs/single_cell_integration/lib/python3.6/site-packages/anndata/base.py in __getitem__(self, index)
1292 def __getitem__(self, index):
1293 """Returns a sliced view of the object."""
-> 1294 return self._getitem_view(index)
1295
1296 def _getitem_view(self, index):
~/.conda/envs/single_cell_integration/lib/python3.6/site-packages/anndata/base.py in _getitem_view(self, index)
1296 def _getitem_view(self, index):
1297 oidx, vidx = self._normalize_indices(index)
-> 1298 return AnnData(self, oidx=oidx, vidx=vidx, asview=True)
1299
1300 # this is used in the setter for uns, if a view
~/.conda/envs/single_cell_integration/lib/python3.6/site-packages/anndata/base.py in __init__(self, X, obs, var, uns, obsm, varm, layers, raw, dtype, shape, filename, filemode, asview, oidx, vidx)
674 if not isinstance(X, AnnData):
675 raise ValueError('`X` has to be an AnnData object.')
--> 676 self._init_as_view(X, oidx, vidx)
677 else:
678 self._init_as_actual(
~/.conda/envs/single_cell_integration/lib/python3.6/site-packages/anndata/base.py in _init_as_view(self, adata_ref, oidx, vidx)
705 self._varm = ArrayView(adata_ref.varm[vidx_normalized], view_args=(self, 'varm'))
706 # hackish solution here, no copy should be necessary
--> 707 uns_new = deepcopy(self._adata_ref._uns)
708 # need to do the slicing before setting the updated self._n_obs, self._n_vars
709 self._n_obs = self._adata_ref.n_obs # use the original n_obs here
~/.conda/envs/single_cell_integration/lib/python3.6/copy.py in deepcopy(x, memo, _nil)
178 y = x
179 else:
--> 180 y = _reconstruct(x, memo, *rv)
181
182 # If is its own copy, don't memoize.
~/.conda/envs/single_cell_integration/lib/python3.6/copy.py in _reconstruct(x, memo, func, args, state, listiter, dictiter, deepcopy)
278 if state is not None:
279 if deep:
--> 280 state = deepcopy(state, memo)
281 if hasattr(y, '__setstate__'):
282 y.__setstate__(state)
~/.conda/envs/single_cell_integration/lib/python3.6/copy.py in deepcopy(x, memo, _nil)
148 copier = _deepcopy_dispatch.get(cls)
149 if copier:
--> 150 y = copier(x, memo)
151 else:
152 try:
~/.conda/envs/single_cell_integration/lib/python3.6/copy.py in _deepcopy_dict(x, memo, deepcopy)
238 memo[id(x)] = y
239 for key, value in x.items():
--> 240 y[deepcopy(key, memo)] = deepcopy(value, memo)
241 return y
242 d[dict] = _deepcopy_dict
~/.conda/envs/single_cell_integration/lib/python3.6/copy.py in deepcopy(x, memo, _nil)
148 copier = _deepcopy_dispatch.get(cls)
149 if copier:
--> 150 y = copier(x, memo)
151 else:
152 try:
~/.conda/envs/single_cell_integration/lib/python3.6/copy.py in _deepcopy_tuple(x, memo, deepcopy)
218
219 def _deepcopy_tuple(x, memo, deepcopy=deepcopy):
--> 220 y = [deepcopy(a, memo) for a in x]
221 # We're not going to put the tuple in the memo, but it's still important we
222 # check for it, in case the tuple contains recursive mutable structures.
~/.conda/envs/single_cell_integration/lib/python3.6/copy.py in <listcomp>(.0)
218
219 def _deepcopy_tuple(x, memo, deepcopy=deepcopy):
--> 220 y = [deepcopy(a, memo) for a in x]
221 # We're not going to put the tuple in the memo, but it's still important we
222 # check for it, in case the tuple contains recursive mutable structures.
~/.conda/envs/single_cell_integration/lib/python3.6/copy.py in deepcopy(x, memo, _nil)
178 y = x
179 else:
--> 180 y = _reconstruct(x, memo, *rv)
181
182 # If is its own copy, don't memoize.
~/.conda/envs/single_cell_integration/lib/python3.6/copy.py in _reconstruct(x, memo, func, args, state, listiter, dictiter, deepcopy)
278 if state is not None:
279 if deep:
--> 280 state = deepcopy(state, memo)
281 if hasattr(y, '__setstate__'):
282 y.__setstate__(state)
~/.conda/envs/single_cell_integration/lib/python3.6/copy.py in deepcopy(x, memo, _nil)
148 copier = _deepcopy_dispatch.get(cls)
149 if copier:
--> 150 y = copier(x, memo)
151 else:
152 try:
~/.conda/envs/single_cell_integration/lib/python3.6/copy.py in _deepcopy_dict(x, memo, deepcopy)
238 memo[id(x)] = y
239 for key, value in x.items():
--> 240 y[deepcopy(key, memo)] = deepcopy(value, memo)
241 return y
242 d[dict] = _deepcopy_dict
~/.conda/envs/single_cell_integration/lib/python3.6/copy.py in deepcopy(x, memo, _nil)
178 y = x
179 else:
--> 180 y = _reconstruct(x, memo, *rv)
181
182 # If is its own copy, don't memoize.
~/.conda/envs/single_cell_integration/lib/python3.6/copy.py in _reconstruct(x, memo, func, args, state, listiter, dictiter, deepcopy)
278 if state is not None:
279 if deep:
--> 280 state = deepcopy(state, memo)
281 if hasattr(y, '__setstate__'):
282 y.__setstate__(state)
~/.conda/envs/single_cell_integration/lib/python3.6/copy.py in deepcopy(x, memo, _nil)
148 copier = _deepcopy_dispatch.get(cls)
149 if copier:
--> 150 y = copier(x, memo)
151 else:
152 try:
~/.conda/envs/single_cell_integration/lib/python3.6/copy.py in _deepcopy_dict(x, memo, deepcopy)
238 memo[id(x)] = y
239 for key, value in x.items():
--> 240 y[deepcopy(key, memo)] = deepcopy(value, memo)
241 return y
242 d[dict] = _deepcopy_dict
~/.conda/envs/single_cell_integration/lib/python3.6/copy.py in deepcopy(x, memo, _nil)
178 y = x
179 else:
--> 180 y = _reconstruct(x, memo, *rv)
181
182 # If is its own copy, don't memoize.
~/.conda/envs/single_cell_integration/lib/python3.6/copy.py in _reconstruct(x, memo, func, args, state, listiter, dictiter, deepcopy)
305 key = deepcopy(key, memo)
306 value = deepcopy(value, memo)
--> 307 y[key] = value
308 else:
309 for key, value in dictiter:
~/.conda/envs/single_cell_integration/lib/python3.6/site-packages/anndata/base.py in __setitem__(self, idx, value)
455 else:
456 adata_view, attr_name = self._view_args
--> 457 _init_actual_AnnData(adata_view)
458 getattr(adata_view, attr_name)[idx] = value
459
~/.conda/envs/single_cell_integration/lib/python3.6/site-packages/anndata/base.py in _init_actual_AnnData(adata_view)
373
374 def _init_actual_AnnData(adata_view):
--> 375 if adata_view.isbacked:
376 raise ValueError(
377 'You cannot modify elements of an AnnData view, '
~/.conda/envs/single_cell_integration/lib/python3.6/site-packages/anndata/base.py in isbacked(self)
1188 def isbacked(self):
1189 """``True`` if object is backed on disk, ``False`` otherwise."""
-> 1190 return self.filename is not None
1191
1192 @property
~/.conda/envs/single_cell_integration/lib/python3.6/site-packages/anndata/base.py in filename(self)
1204 want to copy the previous file, use ``copy(filename='new_filename')``.
1205 """
-> 1206 return self.file.filename
1207
1208 @filename.setter
AttributeError: 'AnnData' object has no attribute 'file'
Running Scanpy 1.3.2 on 2018-10-29 14:22.
anndata==0.6.10 numpy==1.14.3 scipy==1.1.0 pandas==0.23.4 scikit-learn==0.20.0 statsmodels==0.9.0 python-igraph==0.7.1 louvain==0.6.1 matplotlib==3.0.0 seaborn==0.9.0
Hi all,
when loading the data, in case of duplicates I usually choose the item with the highest median (e.g. the gene with the highest median signal). With a pandas DataFrame it can be done as easily as this:
df = df.T
df['Median'] = df.median(axis=1)
df = df.sort_values(by=['Median'], ascending=False, na_position='last')
df = df.drop(columns=['Median'])
df = df.groupby(level=1).first()
df = df.T
considering genes in columns and samples in index. I find it more useful than:
Line 10 in 1c05290
since you don't change the name of genes. I don't know however how to implement it using AnnData. Are you interested in integrating it in AnnData, maybe also with different options (e.g. average, etc.)?
Thanks,
Francesco
AnnData doesn't seem to like h5ad files with a single observation:
In [54]: anndata.__version__
Out[54]: '0.6.6'
In [55]: df.shape
Out[55]: (23465, 1)
In [56]: adata = anndata.AnnData(df.values.T, {"cell_names": df.columns.values}, {"gene_names": df.index.values})
In [57]: adata.write("test.h5ad")
In [58]: bdata = anndata.read_h5ad("test.h5ad")
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-58-222d2ae8c407> in <module>()
----> 1 bdata = anndata.read_h5ad("test.h5ad")
~/.local/lib/python3.5/site-packages/anndata/readwrite/read.py in read_h5ad(filename, backed)
342 # load everything into memory
343 d = _read_h5ad(filename=filename)
--> 344 return AnnData(d)
345
346
~/.local/lib/python3.5/site-packages/anndata/base.py in __init__(self, X, obs, var, uns, obsm, varm, raw, dtype, shape, filename, filemode, asview, oidx, vidx)
639 obsm=obsm, varm=varm, raw=raw,
640 dtype=dtype, shape=shape,
--> 641 filename=filename, filemode=filemode)
642
643 def _init_as_view(self, adata_ref, oidx, vidx):
~/.local/lib/python3.5/site-packages/anndata/base.py in _init_as_actual(self, X, obs, var, uns, obsm, varm, raw, dtype, shape, filename, filemode)
758 raise ValueError(
759 'If `X` is a dict no further arguments must be provided.')
--> 760 X, obs, var, uns, obsm, varm, raw = self._from_dict(X)
761
762 # init from AnnData
~/.local/lib/python3.5/site-packages/anndata/base.py in _from_dict(ddata)
1920 if key in d_true_keys[true_key].dtype.names:
1921 d_true_keys[true_key] = pd.DataFrame.from_records(
-> 1922 d_true_keys[true_key], index=key)
1923 break
1924 d_true_keys[true_key].index = d_true_keys[true_key].index.astype('U')
~/.local/lib/python3.5/site-packages/pandas/core/frame.py in from_records(cls, data, index, exclude, columns, coerce_float, nrows)
1267 else:
1268 arrays, arr_columns = _to_arrays(data, columns,
-> 1269 coerce_float=coerce_float)
1270
1271 arr_columns = _ensure_index(arr_columns)
~/.local/lib/python3.5/site-packages/pandas/core/frame.py in _to_arrays(data, columns, coerce_float, dtype)
7493 else:
7494 # last ditch effort
-> 7495 data = lmap(tuple, data)
7496 return _list_to_arrays(data, columns, coerce_float=coerce_float,
7497 dtype=dtype)
~/.local/lib/python3.5/site-packages/pandas/compat/__init__.py in lmap(*args, **kwargs)
129
130 def lmap(*args, **kwargs):
--> 131 return list(map(*args, **kwargs))
132
133 def lfilter(*args, **kwargs):
TypeError: 'numpy.int64' object is not iterable
But faking another cell works fine:
In [59]: df2 = pandas.concat([df, df], axis=1)
In [60]: df2.shape
Out[60]: (23465, 2)
In [61]: adata = anndata.AnnData(df2.values.T, {"cell_names": df2.columns.values}, {"gene_names": df2.index.values})
In [62]: adata.write("test.h5ad")
In [63]: bdata = anndata.read_h5ad("test.h5ad")
In [64]: bdata.n_obs, bdata.n_vars
Out[64]: (2, 23465)
I'm guessing this has something to do with _fix_shapes
Hi, scvi supports automatically loading an .h5ad
file to analyze scRNA-seq. Since we included anndata
as one of our dependencies and we uploaded scvi
to bioconda
channel, I'm wondering if it's okay if I upload a conda
recipe for anndata
to bioconda
?
Test case:
import scanpy.api as sc
import numpy as np
adata = sc.AnnData(X=np.random.binomial(100, .01, (100, 100)))
adata.obs_names = adata.obs_names.astype(str)
# this works fine
adata[0:2,:][:,0:2]
adata.write("tmp.h5ad")
adata_backed = sc.read("tmp.h5ad", backed="r")
# this throws error
adata_backed[0:2,:][:,0:2]
Traceback of final line:
Traceback (most recent call last):
File "h5py/_objects.pyx", line 200, in h5py._objects.ObjectID.__dealloc__
KeyError: 0
Exception ignored in: 'h5py._objects.ObjectID.__dealloc__'
Traceback (most recent call last):
File "h5py/_objects.pyx", line 200, in h5py._objects.ObjectID.__dealloc__
KeyError: 0
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/cellxgene/venv/lib/python3.6/site-packages/anndata/base.py", line 1297, in __getitem__
return self._getitem_view(index)
File "/cellxgene/venv/lib/python3.6/site-packages/anndata/base.py", line 1301, in _getitem_view
return AnnData(self, oidx=oidx, vidx=vidx, asview=True)
File "/cellxgene/venv/lib/python3.6/site-packages/anndata/base.py", line 664, in __init__
self._init_as_view(X, oidx, vidx)
File "/cellxgene/venv/lib/python3.6/site-packages/anndata/base.py", line 689, in _init_as_view
uns_new = deepcopy(self._adata_ref._uns)
File "/usr/lib/python3.6/copy.py", line 180, in deepcopy
y = _reconstruct(x, memo, *rv)
File "/usr/lib/python3.6/copy.py", line 280, in _reconstruct
state = deepcopy(state, memo)
File "/usr/lib/python3.6/copy.py", line 150, in deepcopy
y = copier(x, memo)
File "/usr/lib/python3.6/copy.py", line 240, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/usr/lib/python3.6/copy.py", line 150, in deepcopy
y = copier(x, memo)
File "/usr/lib/python3.6/copy.py", line 220, in _deepcopy_tuple
y = [deepcopy(a, memo) for a in x]
File "/usr/lib/python3.6/copy.py", line 220, in <listcomp>
y = [deepcopy(a, memo) for a in x]
File "/usr/lib/python3.6/copy.py", line 180, in deepcopy
y = _reconstruct(x, memo, *rv)
File "/usr/lib/python3.6/copy.py", line 280, in _reconstruct
state = deepcopy(state, memo)
File "/usr/lib/python3.6/copy.py", line 150, in deepcopy
y = copier(x, memo)
File "/usr/lib/python3.6/copy.py", line 240, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/usr/lib/python3.6/copy.py", line 180, in deepcopy
y = _reconstruct(x, memo, *rv)
File "/usr/lib/python3.6/copy.py", line 280, in _reconstruct
state = deepcopy(state, memo)
File "/usr/lib/python3.6/copy.py", line 150, in deepcopy
y = copier(x, memo)
File "/usr/lib/python3.6/copy.py", line 240, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/usr/lib/python3.6/copy.py", line 180, in deepcopy
y = _reconstruct(x, memo, *rv)
File "/usr/lib/python3.6/copy.py", line 280, in _reconstruct
state = deepcopy(state, memo)
File "/usr/lib/python3.6/copy.py", line 150, in deepcopy
y = copier(x, memo)
File "/usr/lib/python3.6/copy.py", line 240, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/usr/lib/python3.6/copy.py", line 180, in deepcopy
y = _reconstruct(x, memo, *rv)
File "/usr/lib/python3.6/copy.py", line 280, in _reconstruct
state = deepcopy(state, memo)
File "/usr/lib/python3.6/copy.py", line 150, in deepcopy
y = copier(x, memo)
File "/usr/lib/python3.6/copy.py", line 240, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/usr/lib/python3.6/copy.py", line 180, in deepcopy
y = _reconstruct(x, memo, *rv)
File "/usr/lib/python3.6/copy.py", line 280, in _reconstruct
state = deepcopy(state, memo)
File "/usr/lib/python3.6/copy.py", line 150, in deepcopy
y = copier(x, memo)
File "/usr/lib/python3.6/copy.py", line 240, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/usr/lib/python3.6/copy.py", line 180, in deepcopy
y = _reconstruct(x, memo, *rv)
File "/usr/lib/python3.6/copy.py", line 274, in _reconstruct
y = func(*args)
File "stringsource", line 5, in h5py.h5f.__pyx_unpickle_FileID
File "h5py/_objects.pyx", line 178, in h5py._objects.ObjectID.__cinit__
TypeError: __cinit__() takes exactly 1 positional argument (0 given)
The faster indexing solution
var_names
as index to speed up lookupsIf the var_names
happen to be strings, that works (slower than necessary), but if they’re integers, this breaks. Example:
In[1]: ad = AnnData(np.array([[0,1,2],[3,4,5]]), var=dict(var_names=[10,11,12]))
In[2]: ad[:, ad.X.sum(0) > 3]
Traceback (most recent call last):
File "<ipython-input-23-d08541977b75>", line 1, in <module>
ad[:, ad.X.sum(0) > 3]
File "anndata/base.py", line 1187, in __getitem__
return self._getitem_view(index)
File "anndata/base.py", line 1190, in _getitem_view
oidx, vidx = self._normalize_indices(index)
File "anndata/base.py", line 1167, in _normalize_indices
var = _normalize_index(var, self.var_names)
File "anndata/base.py", line 244, in _normalize_index
positions = positions[index]
File "pandas/core/series.py", line 809, in __getitem__
return self._get_with(key)
[...]
File "pandas/core/indexing.py", line 1206, in _validate_read_indexer
key=key, axis=self.obj._get_axis_name(axis)))
KeyError: 'None of [[1 2]] are in the [index]'
This here:
would e.g. be nicer as
def concatenate(self, *adatas, batch_key='batch', batch_categories=None): ...
This is because we can still call it with a list that way, but also more easily with multiple adatas. And we don’t have to say “list-or-AnnData”, but only “AnnData(s)”:
adata.concatenate(adata2)
adata.concatenate(adata2, adata3)
adata.concatenate(*some_adatas)
Also generally, when we introduce an API with more than two parameters, one of which has a default, we should do
def foo(bar, *, baz=1, boz=2): ...
or
def foo(bar, *baz, boz=1, biz=2): ...
(every keyword argument after a splat star is keyword only. this prevents errors)
Hi,
I use scanpy 1.3.1. I have tried the 'read_loom' function, but it produced the following error:
adata = sc.read(filename=path_to_velocyto_files + 'all_controls.loom')
--> This might be very slow. Consider passing `cache=True`, which enables much faster reading from a cache file.
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-52-61a36c9ba297> in <module>()
----> 1 adata = sc.read(filename=path_to_velocyto_files + 'all_controls.loom')
~/anaconda3/lib/python3.6/site-packages/scanpy/readwrite.py in read(filename, backed, sheet, ext, delimiter, first_column_names, backup_url, cache, **kwargs)
73 return _read(filename, backed=backed, sheet=sheet, ext=ext,
74 delimiter=delimiter, first_column_names=first_column_names,
---> 75 backup_url=backup_url, cache=cache, **kwargs)
76 # generate filename and read to dict
77 filekey = filename
~/anaconda3/lib/python3.6/site-packages/scanpy/readwrite.py in _read(filename, backed, sheet, ext, delimiter, first_column_names, backup_url, cache, suppress_cache_warning, **kwargs)
311 adata = _read_softgz(filename)
312 elif ext == 'loom':
--> 313 adata = read_loom(filename=filename, **kwargs)
314 else:
315 raise ValueError('Unkown extension {}.'.format(ext))
~/anaconda3/lib/python3.6/site-packages/anndata/readwrite/read.py in read_loom(filename, sparse, cleanup, X_name, obs_names, var_names)
144 filename = fspath(filename) # allow passing pathlib.Path objects
145 from loompy import connect
--> 146 with connect(filename, 'r') as lc:
147
148 if X_name not in lc.layers.keys(): X_name = ''
AttributeError: __enter__
I have found that when I try to read the file the interactive console, I have the following result:
> filename =os.fspath(path_to_velocyto_files + 'all_controls.loom')
> lc =connect(filename, 'r')
> lc.layer.keys()
dict_keys(['', 'ambiguous', 'spliced', 'unspliced'])
So for my loom file, it's not 'layers', but 'layer'. Can you consider to include this case in the anndata read function?
This issue is meant to serve as a discussion page for establishing conventions for storing sparse data in HDF5 files.
The suggestion made within anndata is described here.
BoundRecArray
objects don't keep their attributes when pickled:
>>> from anndata import AnnData
/usr/local/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
>>> import pickle
>>> import numpy as np
>>>
>>> adata = AnnData()
>>> adata.obsm._parent == adata
True
>>> adata2 = pickle.loads(pickle.dumps(adata))
>>> adata2.obsm._parent == adata2
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/numpy/core/records.py", line 450, in __getattribute__
res = fielddict[attr][:2]
KeyError: '_parent'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.6/site-packages/numpy/core/records.py", line 452, in __getattribute__
raise AttributeError("recarray has no attribute %s" % attr)
AttributeError: recarray has no attribute _parent
>>> adata2.obsm.__dict__
{}
>>> adata.obsm.__dict__
{'_parent': AnnData object with n_obs × n_vars = 0 × 0 , '_attr': 'obsm'}
Based on this stackoverflow question, I think the issue comes from subclassing a numpy
object, which have custom code for pickling.
For example, "column_annotations" instead of "smp"
Hey,
Currently csv/tsv files with gzip or bzip2 compression are not supported, if I'm not mistaken. There is an issue in DCA (theislab/dca#7) about this, so I wanted to file a tracking issue here.
I was thinking about simply adding gzip.open()
and bzip2.open()
calls based on the file extension, just like other functions in the implementation, but there is also an option of using pandas for that because if we add compression support to anndata read_text()
will start to converge pandas.read_csv()
.
So would it make sense to use pandas.read_csv
for all text file reading functionality? It's already a dependency of anndata, so I don't see why not.
What I mean by this is that you might have a categorical, which internally in the h5ad
is represented by e.g. uns/condition_categories
. If there is only one condition
(perhaps because it's a subset of data), this will fail the Pandas check for unique categories, because the shape
for uns/condition_categories
is (1, )
.
This can be avoided by when writing the file to h5py
, detecting singleton categories and appending a dummy category, to ensure there are at least two unique values in uns/conditions_categories
.
The concatenate function does not take care of layers of an anndata object yet.
I was playing around with some visualization on a large dataset, when I noticed some surprisingly high memory usage. I think I've narrowed it down to unexpected memory growth from taking views:
In [1]: import scanpy.api as sc
In [2]: %load_ext memory_profiler
In [3]: %memit adata = sc.read("bm.h5ad")
peak memory: 5317.54 MiB, increment: 5187.05 MiB
In [4]: %memit
peak memory: 2624.52 MiB, increment: 0.00 MiB
In [5]: %memit view = adata[:, (adata.var["n_cells_by_counts"] > 10000)]
peak memory: 5299.68 MiB, increment: 2675.16 MiB
In [6]: %memit
peak memory: 5080.07 MiB, increment: 0.00 MiB
My assumption here being: taking a view shouldn't cause noticeable growth in memory usage. I'm pretty sure it's not just how memory_profiler
is counting objects, since top
and ActivityMonitor pick this up as well.
We started using syntax only available on Python 3.6, but our setup.py
says we support 3.5.
anndata
is no longer tested on 3.5 since a dependency (loompy) needs 3.6, but we should simply skip loompy tests on 3.5 instead.
Hi,
I'm trying to read in a matrix file in the format of 'float64'. I did the following:
adata = ad.read_text('./test.txt',delimiter='\t',dtype='float64')
I have specified the dtype as 'float64' but still adata.X
is showing the default 'float32'. Is it a bug or did I miss anything?
I'm attaching a short script here to reproduce this issue. I will really appreciate your help. Many thanks!
Archive.zip
This may be intentional, but there seems to be an issue when the raw data only has one row. We have come across this in our unit tests when creating a small example data set with only one observation.
In [1]: import anndata
In [2]: anndata.__version__
Out[2]: '0.6.10'
In [3]: import pandas as pd
In [4]: d = [
...: (1, 'A', 'a', 'Z', 'z'),
...: (2, 'A', 'b', 'Z', 'z'),
...: (3, 'B', 'c', 'Z', 'z'),
...: (4, 'B', 'd', 'Z', 'z'),
...: ]
In [5]: df = pd.DataFrame(d, columns='c0 c1 c2 c3 c4'.split())
...: df
Out[5]:
c0 c1 c2 c3 c4
0 1 A a Z z
1 2 A b Z z
2 3 B c Z z
3 4 B d Z z
In [6]: df = df.set_index('c1 c2 c3 c4'.split())['c0'].unstack(level=[2, 3]).T
...: df
Out[6]:
c1 A B
c2 a b c d
c3 c4
Z z 1 2 3 4
In [7]: def convert_idx(x): return x.to_frame().reset_index(drop=True)
In [8]: obs = convert_idx(df.index)
In [9]: var = convert_idx(df.columns)
In [10]: a = anndata.AnnData(X=df.values, obs=obs, var=var)
In [11]: a
Out[11]:
AnnData object with n_obs × n_vars = 1 × 4
obs: 'c3', 'c4'
var: 'c1', 'c2'
In [12]: a.obs
Out[12]:
c3 c4
0 Z z
In [13]: a.var
Out[13]:
c1 c2
0 A a
1 A b
2 B c
3 B d
In [14]: assert a.shape == a.X.shape, '{} != {}'.format(a.shape, a.X.shape)
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-14-7f38676dce05> in <module>()
----> 1 assert a.shape == a.X.shape, '{} != {}'.format(a.shape, a.X.shape)
AssertionError: (1, 4) != (4,)
In [15]: a[0, 0]
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-15-0d768f50cb80> in <module>()
----> 1 a[0, 0]
~/anaconda3/envs/dev-procanswon-py/lib/python3.6/site-packages/anndata/base.py in __getitem__(self, index)
1292 def __getitem__(self, index):
1293 """Returns a sliced view of the object."""
-> 1294 return self._getitem_view(index)
1295
1296 def _getitem_view(self, index):
~/anaconda3/envs/dev-procanswon-py/lib/python3.6/site-packages/anndata/base.py in _getitem_view(self, index)
1296 def _getitem_view(self, index):
1297 oidx, vidx = self._normalize_indices(index)
-> 1298 return AnnData(self, oidx=oidx, vidx=vidx, asview=True)
1299
1300 # this is used in the setter for uns, if a view
~/anaconda3/envs/dev-procanswon-py/lib/python3.6/site-packages/anndata/base.py in __init__(self, X, obs, var, uns, obsm, varm, layers, raw, dtype, shape, filename, filemode, asview, oidx, vidx)
674 if not isinstance(X, AnnData):
675 raise ValueError('`X` has to be an AnnData object.')
--> 676 self._init_as_view(X, oidx, vidx)
677 else:
678 self._init_as_actual(
~/anaconda3/envs/dev-procanswon-py/lib/python3.6/site-packages/anndata/base.py in _init_as_view(self, adata_ref, oidx, vidx)
735 # set data
736 if self.isbacked: self._X = None
--> 737 else: self._init_X_as_view()
738
739 self._layers = AnnDataLayers(self, adata_ref=adata_ref, oidx=oidx, vidx=vidx)
~/anaconda3/envs/dev-procanswon-py/lib/python3.6/site-packages/anndata/base.py in _init_X_as_view(self)
750 self._X = None
751 return
--> 752 X = self._adata_ref.X[self._oidx, self._vidx]
753 if len(X.shape) == 2:
754 n_obs, n_vars = X.shape
IndexError: too many indices for array
Xarray has a lot of advantages, e.g.:
The only big problem currently is the missing sparse data support, but this will be changed (hopefully in the near) future:
pydata/xarray#1375
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.