Comments (18)
It is a bug! 🙂 Thanks for pointing it out!
from anndata.
Related:
When all column names are integers or strings of integers, anndata.read_h5ad()
fails to read a written dataset:
# dataset = anndata.read_h5ad(file)
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/pandas/core/arrays/categorical.py", line 581, in from_codes
codes = coerce_indexer_dtype(np.asarray(codes), categories)
File "/usr/lib/python3.6/site-packages/pandas/core/dtypes/cast.py", line 606, in coerce_indexer_dtype
return _ensure_int16(indexer)
File "pandas/_libs/algos_common_helper.pxi", line 3209, in pandas._libs.algos.ensure_int16
File "pandas/_libs/algos_common_helper.pxi", line 3214, in pandas._libs.algos.ensure_int16
ValueError: invalid literal for int() with base 10: 'ENSG00000243485'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2963, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-9-4e533aad0e71>", line 1, in <module>
dataset = anndata.read_h5ad(file)
File "/usr/lib/python3.6/site-packages/anndata/readwrite/read.py", line 345, in read_h5ad
return AnnData(d)
File "/usr/lib/python3.6/site-packages/anndata/base.py", line 636, in __init__
filename=filename, filemode=filemode)
File "/usr/lib/python3.6/site-packages/anndata/base.py", line 755, in _init_as_actual
X, obs, var, uns, obsm, varm, raw = self._from_dict(X)
File "/usr/lib/python3.6/site-packages/anndata/base.py", line 1878, in _from_dict
categories=v)
File "/usr/lib/python3.6/site-packages/pandas/core/arrays/categorical.py", line 584, in from_codes
"codes need to be convertible to an arrays of integers")
ValueError: codes need to be convertible to an arrays of integers
from anndata.
Yes, currently only string column names are allowed. It would take a while to adapt anndata to integer names. Why do you need it?
In any case: there should more meaningful error messages. Sorry for not having thought of this.
from anndata.
I stumbled across this when I read some barcodes.tsv with Pandas and headers=False.
Headers=False causes Pandas to create column names by simply numbering them.
This is usually no problem since the column names are usually set afterwards (e.g. adata.obs_names=...).
However, I tried to join multiple AnnData objects and did not want to name any columns before merging them.
I do not need integer column names, I just thought this would be a bug :)
from anndata.
but we can still leave this open and support them at one point. there’s no non-technical reason for only supporting string names
from anndata.
oh yes good to know... I was just writing an issue on this. This could easily be fixed by just converting integer arguments to strings.
from anndata.
that would be too surprising. if people then try to index with integers and it doesn’t work as expected, they’ll get confused
from anndata.
Because after loading the columns could not be indexed by integers anymore? Damn...
Would it be possible to write an error message until this works?
Edit: editing around, definitely too early for me :)
from anndata.
Certainly.
I think the problem with the whole thing is that I was wrong. there is a non-technical reason why this is not supported:
If we support integers as column names, indexing becomes ambiguous: does adata[0,:]
mean the first row or the row with the column name 0
?
pandas got around it via .loc
and .iloc
where one operates on names and one on indices. What should we do?
from anndata.
So the problem arises with the annotations, which are pandas dataframes.
In general, it seems reasonable not to change the functionality and behaviour of third party packages, so since pandas supports indexing by integers, you may also support it with anndata, with all its caveats.
When you convert a data frame to a record array and back, all numerical string values, will be converted to integers.
>>> df = pd.DataFrame({0: [1, 2, 3], '1': [4, 5, 6]})
>>> pd.DataFrame.from_records(df.to_records())
index 0 1
0 0 1 4
1 1 2 5
2 2 3 6
I think it is more important that the data is written and can be read properly. You can easily warn the user about problems when mixing up int and str indexes, within the documentation.
from anndata.
pandas got around it via .loc and .iloc where one operates on names and one on indices. What should we do?
Yes, I'm well aware of that and I've thought several times about whether we should do this, too. but as of today, I vote against it. I cannot think of a single good practical reason where integer indices are meaningful. The only practical reason that I can think of concerns memory usage. The only convincing - but not practical - reason would be full consistency with the full flexibility of pandas. But I like the current behavior of []
; and the docs explain it well, I think:
Indexing into an AnnData object with a numeric is supposed to be positional, like pandas .iloc method, while indexing with a string/ categorical is supposed to behave like .loc.
In general, it seems reasonable not to change the functionality and behaviour of third party packages,
We don't change functionality of pandas. But you're right, we make more assumptions than pandas: only string indices, only string categories (like R's factors). That's not a good thing and it should usually be solved by generating a subclass that restricts to that functionality. But that is pretty tricky with pandas dataframes and we don't want to get into that. Writing warnings is a good thing. I don't know why I still haven't added these warnings. I'd actually even throw an error. But in a way the solution will always be half-cooked, because I don't want to rewrite the actual index object of pandas via subclassing, but just the setters in AnnData for .obs
, .var
, .var_names
, .obs_names
, which are also called in the constructor. So, people will still be able to call .obs[whatever_you_like]
and put there whatever they like, which is bad... Hm.
Happy to discuss more.
from anndata.
We were mixing up things, here. Integer column names are no technical problem. Integer indices are a technical problem. The latter is addressed through the warnings, here: 51be5d8.
The former should be addressed in a different way.
from anndata.
I had this issue as well–a very simple use case where this comes up is if I try to use lists for gene and cell names:
adata = anndata.AnnData(X, obs=cell_names, var=gene_names)
The resulting object has integer column names for obs
and var
. From a usability perspective, I think the above code (which is simple and intuitive) should Just Work™. The solution is to create an empty pandas DataFrame with the specified index (but not an Index or a Series!), but that was not at all obvious to me...
from anndata.
Hm, I disagree: the above shouldn't work, as the docs clearly state that you need to call obs=pd.DataFrame(index=cell_names)
. Alternatively, you can also create the object and then do adata.obs.index = cell_names
.
It's a bug that we don't throw an error if something other than a dataframe
or dict
is passed.
@flying-sheep: I'll move forward and add the type check, you agree?
from anndata.
Sure, it should either work correctly or raise an error immediately–silently working until you try to write a file is the worst of both worlds.
For what it's worth I disagree that "the docs clearly state" the required usage–the DataFrame(index=cell_names)
syntax is never actually mentioned.
from anndata.
Thanks for pointing that out! :)
I meant the docs "clearly state that one should pass a dataframe or a dict". But right, we should add some more explanation. I'll add that together with the type check. I agree that silently working is the worst that can happen. It has just never been pointed out by anyone that you could pass a simple iterable to obs
and the constructor wouldn't complain. 🙈
from anndata.
Update on this:
Previously, conversion to a rec array converted any columns names of obs to strings. Now we don't use a recarray intermediate, but instead write each column as it's own hdf5 dataset to the h5py file. hdf5 datasets can't have integers for names, so that fails.
from anndata.
#777 is a more complete version of this issue, so closing this to retain only one duplicate
from anndata.
Related Issues (20)
- `to_df` docstrings
- dask.array.store but with anndata schema HOT 1
- Indexing With a Dask Array HOT 5
- adata.raw = adata HOT 1
- GPU CI tests failing HOT 2
- Add working `GroupStorageType` type annotation
- 0.10.6 - Cannot write files due to implicit conversion to str dtype. HOT 2
- Saving .h5ad with pd.Series in .uns results in IORegistryError HOT 3
- `read_elem_as_dask` HOT 8
- `FileNotFoundError: [Errno 2] No such file or directory: 'llvm-config'` on CI HOT 2
- Upcoming Scipy 1.13 incompatible with file-backed sparse matrices
- CZI files' `var` count changd
- `/` in column names makes AnnData Zarr object unreadable on windows HOT 12
- String indexes in var lead to UserWarning HOT 1
- Move tests out of the package
- Refactor `BaseCompressedSparseDataset` and subclasses to remove `backed_{csr,csc}_matrix`
- concat_on_disk should use int64 indptrs by default so that it can concatenate large files HOT 4
- TypeError: Can't implicitly convert non-string objects to strings HOT 4
- Error concatenating scRNA with visium dataset
- Error: Expected a python object, received a character HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from anndata.