Hi, I recently stumbled about the following problem: <div class="highlight highlig

[bug] Support integers as column names about anndata HOT 18 CLOSED

scverse commented on July 18, 2024

[bug] Support integers as column names

from anndata.

Comments (18)

falexwolf commented on July 18, 2024 1

It is a bug! 🙂 Thanks for pointing it out!

from anndata.

Hoeze commented on July 18, 2024

Related:
When all column names are integers or strings of integers, anndata.read_h5ad() fails to read a written dataset:

# dataset = anndata.read_h5ad(file)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/pandas/core/arrays/categorical.py", line 581, in from_codes
    codes = coerce_indexer_dtype(np.asarray(codes), categories)
  File "/usr/lib/python3.6/site-packages/pandas/core/dtypes/cast.py", line 606, in coerce_indexer_dtype
    return _ensure_int16(indexer)
  File "pandas/_libs/algos_common_helper.pxi", line 3209, in pandas._libs.algos.ensure_int16
  File "pandas/_libs/algos_common_helper.pxi", line 3214, in pandas._libs.algos.ensure_int16
ValueError: invalid literal for int() with base 10: 'ENSG00000243485'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2963, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-9-4e533aad0e71>", line 1, in <module>
    dataset = anndata.read_h5ad(file)
  File "/usr/lib/python3.6/site-packages/anndata/readwrite/read.py", line 345, in read_h5ad
    return AnnData(d)
  File "/usr/lib/python3.6/site-packages/anndata/base.py", line 636, in __init__
    filename=filename, filemode=filemode)
  File "/usr/lib/python3.6/site-packages/anndata/base.py", line 755, in _init_as_actual
    X, obs, var, uns, obsm, varm, raw = self._from_dict(X)
  File "/usr/lib/python3.6/site-packages/anndata/base.py", line 1878, in _from_dict
    categories=v)
  File "/usr/lib/python3.6/site-packages/pandas/core/arrays/categorical.py", line 584, in from_codes
    "codes need to be convertible to an arrays of integers")
ValueError: codes need to be convertible to an arrays of integers

from anndata.

falexwolf commented on July 18, 2024

Yes, currently only string column names are allowed. It would take a while to adapt anndata to integer names. Why do you need it?

In any case: there should more meaningful error messages. Sorry for not having thought of this.

from anndata.

Hoeze commented on July 18, 2024

I stumbled across this when I read some barcodes.tsv with Pandas and headers=False.
Headers=False causes Pandas to create column names by simply numbering them.

This is usually no problem since the column names are usually set afterwards (e.g. adata.obs_names=...).
However, I tried to join multiple AnnData objects and did not want to name any columns before merging them.

I do not need integer column names, I just thought this would be a bug :)

from anndata.

flying-sheep commented on July 18, 2024

but we can still leave this open and support them at one point. there’s no non-technical reason for only supporting string names

from anndata.

sirno commented on July 18, 2024

oh yes good to know... I was just writing an issue on this. This could easily be fixed by just converting integer arguments to strings.

from anndata.

flying-sheep commented on July 18, 2024

that would be too surprising. if people then try to index with integers and it doesn’t work as expected, they’ll get confused

from anndata.

sirno commented on July 18, 2024

Because after loading the columns could not be indexed by integers anymore? Damn...

Would it be possible to write an error message until this works?

Edit: editing around, definitely too early for me :)

from anndata.

flying-sheep commented on July 18, 2024

Certainly.

I think the problem with the whole thing is that I was wrong. there is a non-technical reason why this is not supported:

If we support integers as column names, indexing becomes ambiguous: does adata[0,:] mean the first row or the row with the column name 0?

pandas got around it via .loc and .iloc where one operates on names and one on indices. What should we do?

from anndata.

sirno commented on July 18, 2024

So the problem arises with the annotations, which are pandas dataframes.

In general, it seems reasonable not to change the functionality and behaviour of third party packages, so since pandas supports indexing by integers, you may also support it with anndata, with all its caveats.

When you convert a data frame to a record array and back, all numerical string values, will be converted to integers.

>>> df = pd.DataFrame({0: [1, 2, 3], '1': [4, 5, 6]})  
>>> pd.DataFrame.from_records(df.to_records())                                                   
   index  0  1
0      0  1  4
1      1  2  5
2      2  3  6

I think it is more important that the data is written and can be read properly. You can easily warn the user about problems when mixing up int and str indexes, within the documentation.

from anndata.

falexwolf commented on July 18, 2024

pandas got around it via .loc and .iloc where one operates on names and one on indices. What should we do?

Yes, I'm well aware of that and I've thought several times about whether we should do this, too. but as of today, I vote against it. I cannot think of a single good practical reason where integer indices are meaningful. The only practical reason that I can think of concerns memory usage. The only convincing - but not practical - reason would be full consistency with the full flexibility of pandas. But I like the current behavior of []; and the docs explain it well, I think:

Indexing into an AnnData object with a numeric is supposed to be positional, like pandas .iloc method, while indexing with a string/ categorical is supposed to behave like .loc.

In general, it seems reasonable not to change the functionality and behaviour of third party packages,

We don't change functionality of pandas. But you're right, we make more assumptions than pandas: only string indices, only string categories (like R's factors). That's not a good thing and it should usually be solved by generating a subclass that restricts to that functionality. But that is pretty tricky with pandas dataframes and we don't want to get into that. Writing warnings is a good thing. I don't know why I still haven't added these warnings. I'd actually even throw an error. But in a way the solution will always be half-cooked, because I don't want to rewrite the actual index object of pandas via subclassing, but just the setters in AnnData for .obs, .var, .var_names, .obs_names, which are also called in the constructor. So, people will still be able to call .obs[whatever_you_like] and put there whatever they like, which is bad... Hm.

Happy to discuss more.

from anndata.

falexwolf commented on July 18, 2024

We were mixing up things, here. Integer column names are no technical problem. Integer indices are a technical problem. The latter is addressed through the warnings, here: 51be5d8.

The former should be addressed in a different way.

from anndata.

jamestwebber commented on July 18, 2024

I had this issue as well–a very simple use case where this comes up is if I try to use lists for gene and cell names:

adata = anndata.AnnData(X, obs=cell_names, var=gene_names)

The resulting object has integer column names for obs and var. From a usability perspective, I think the above code (which is simple and intuitive) should Just Work™. The solution is to create an empty pandas DataFrame with the specified index (but not an Index or a Series!), but that was not at all obvious to me...

from anndata.

falexwolf commented on July 18, 2024

Hm, I disagree: the above shouldn't work, as the docs clearly state that you need to call obs=pd.DataFrame(index=cell_names). Alternatively, you can also create the object and then do adata.obs.index = cell_names.

It's a bug that we don't throw an error if something other than a dataframe or dict is passed.

@flying-sheep: I'll move forward and add the type check, you agree?

from anndata.

jamestwebber commented on July 18, 2024

Sure, it should either work correctly or raise an error immediately–silently working until you try to write a file is the worst of both worlds.

For what it's worth I disagree that "the docs clearly state" the required usage–the DataFrame(index=cell_names) syntax is never actually mentioned.

from anndata.

falexwolf commented on July 18, 2024

Thanks for pointing that out! :)

I meant the docs "clearly state that one should pass a dataframe or a dict". But right, we should add some more explanation. I'll add that together with the type check. I agree that silently working is the worst that can happen. It has just never been pointed out by anyone that you could pass a simple iterable to obs and the constructor wouldn't complain. 🙈

from anndata.

ivirshup commented on July 18, 2024

Update on this:

Previously, conversion to a rec array converted any columns names of obs to strings. Now we don't use a recarray intermediate, but instead write each column as it's own hdf5 dataset to the h5py file. hdf5 datasets can't have integers for names, so that fails.

from anndata.

flying-sheep commented on July 18, 2024

#777 is a more complete version of this issue, so closing this to retain only one duplicate

from anndata.

[bug] Support integers as column names about anndata HOT 18 CLOSED

Comments (18)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent