Giter VIP home page Giter VIP logo

Comments (13)

falexwolf avatar falexwolf commented on July 18, 2024 3

OK, so now we at least have ._X always 2d. #55 I'm still not fully convinced by your super-verbose solution. ;) Also not by my own.

One other solution: go away completely from .X and transition to .values, which returns a container for arrays, sparse matrices and backing files.

Calling adata[:, 'gene1'].values will always return something 2d. But calling adata.values[:, 'gene1'] will automatically be flattened, as in numpy, pandas, tensorflow...

.X will remain for backwards compat.

What do you think, @flying-sheep (also @bretttully if you want to comment).

from anndata.

falexwolf avatar falexwolf commented on July 18, 2024

This is intentional and the same behavior as of numpy arrays, see

adata = AnnData(np.ones((2, 2)))
adata[:, 0].X == adata.X[:, 0]

as mentioned in the docs: https://anndata.readthedocs.io/en/latest/anndata.AnnData.html

Just recently we had a discussion of whether this helpful or not. I personally think it's helpful. What do you think.

from anndata.

bretttully avatar bretttully commented on July 18, 2024

I can see why in your example this behaviour is good, but in the two errors in the original post it is very confusing.

Without knowing more, my intuition was:

  • the shape of the AnnData and the shape of the underlying X should be the same
  • if the size of .obs is >=1 and size of .var is >=1 I would expect the be able to slice the AnnData object with a 2D index.

Generally, I've come across scenarios in numpy and pandas where applying the same function on an array will return different 'shape' results based on the input. With pandas, there are times when it remains as a DataFrame or can switch to a Series. These are pretty annoying to deal with as a library author, but it probably indicates that these edge cases are hard... With numpy, the reshape function with a -1 is normally an easy way out, so maybe you could surface this with AnnData?

from anndata.

falexwolf avatar falexwolf commented on July 18, 2024

Yes, very happy to add a AnnData.reshape function!

Also happy to change the output of AnnData.shape to (,1) if .X gets flattened.

That's an interesting point that you make about pandas. I know that

df = pd.DataFrame(np.ones(3, 2))
df.loc[0, :]
df.loc[:, 0]

returns a Series - hence a 1d object - in both of the last lines; so this is consistent with slicing a numpy array and an AnnData. As for numpy arrays, consider it highly convenient that a DataFrame behaves this way. Slicing out a single vector is one of the most common operations that you apply to a DataFrame and I'd consider it highly bothersome if one had to .flatten() a (1, n)-shaped object every time.

Why is it bothersome to you the other way round?

from anndata.

bretttully avatar bretttully commented on July 18, 2024

Here is the anti-pattern we are trying to avoid for our users:

anndata_obj = a_function_filtering_our_db_based_on_user_params(**kwargs)
if anndata_obj.X.ndim == 1:
    first_obs_data = anndata_obj[:]
else:
    first_obs_data = anndata_obj[0, :]

If we can deal with the if statement inside our a_function_filtering_our_db_based_on_user_params() function so that we promise our users they can always index with combination of obs and var with an AnnData object returned by our API.

Thinking about the reshape function, it really doesn't need the power of numpy's version (in fact, that might be actively bad, as the shape of X may get out of sync with n_obs and n_var). Really, the method is enforce_2d(flag: bool) -- perhaps this sets a flag in the metadata of the object so that it is persisted through the hdf backend?

from anndata.

falexwolf avatar falexwolf commented on July 18, 2024

Makes sense! If it's only about retaining indexing with var and obs indices, this can easily be enabled even if AnnData's with 1d X are allowed (also right now, the AnnData knows whether the 1d vector is an obs or a var).

But I see that an .enforce_2d(flag: bool) could be useful in several instances. I'm sure you'd like to avoid your user having to call this explicitly. So, do you want, additionally to a method .enforce_2d() an enforce_2d parameter in the constructor?

I'd tend to call it force_2d if that's not bad English (we have a force_sparse parameter somewhere else, which I could alternatively rename to enforce_sparse).

from anndata.

bretttully avatar bretttully commented on July 18, 2024

But I see that an .enforce_2d(flag: bool) could be useful in several instances. I'm sure you'd like to avoid your user having to call this explicitly. So, do you want, additionally to a method .enforce_2d() an enforce_2d parameter in the constructor?

This would be useful in the constructor, yes -- would you try and persist that when creating a view/copy of the object? I'm ambivalent as to how that should work.

I'd tend to call it force_2d

No worries, not bad English at all

from anndata.

falexwolf avatar falexwolf commented on July 18, 2024

This would be useful in the constructor, yes -- would you try and persist that when creating a view/copy of the object? I'm ambivalent as to how that should work.

Of course, this would persist throughout copies and views. Wouldn't be a problem.

Just yesterday, another question came up regarding efficient extraction of a slice of .X from an AnnData (scverse/scanpy#244 (comment)). As it's likely needed that another accessor is written for this, and as this could serve as a convenience accessor that provides (n x 1) matrices as vectors to users, I can now think of an AnnData whose data matrix is always 2d. (I previously admitted that "another access operation" that would provide 1d vectors would have been the better design choice and @flying-sheep always wanted to keep things 2d: #55)

  • We need to think about how to make this an easy transition for users, many of them have the call adata[:, 'gene_name'].X in their code and expect it to return a vector. While enforcing ._X to be always 2d could be done in a backwards compatible way by calling .flatten in the property .X if ._X is (n x 1), this wouldn't make you happy, right? Your test above would still break.
  • We need an elegant name for the convenience access: I guess I'd call it .slice as .sliceX is ugly. Calling .slice wouldn't slice the whole AnnData but only return a slice of the data matrix potentially returning something flattened.

@flying-sheep: You don't need to go through the whole issue, just consider the two bullets above. What do you think? I guess you would be happier with this

from anndata.

falexwolf avatar falexwolf commented on July 18, 2024

Another option: Instead of letting.X directly point to ._X, we make it a convenience accessor that enables slicing by strings and get's flattened if shaped (n x 1). At the same time, we provide the underlying data ._X (np.ndarray, sp.spmatrix, h5py.Dataset), which cannot be sliced by strings and is always 2d, with a different name (e.g. .X2d).

The big advantage is that this is backwards compatible. But I fear it's super confusing to have access to the data matrix twice in the object...

from anndata.

bretttully avatar bretttully commented on July 18, 2024

Not withstanding backwards compatibility, I would take a different approach with something akin to the following:

def extract_vars(self, vars: Union[str, List[str]], as_adata: bool = False):
    # most efficient method here...
    result = self[:, vars]
    if not as_adata:
        result = result._X
        if not isinstance(vars, list):
            result = result.flatten()
    return result

That way, you have better encapsulation and if you need to change implementation details like the shape of _X you can do it without impacting users. It also seemed like that other thread was not sure about most efficient way to do things, so perhaps this makes it more explicit?

from anndata.

bretttully avatar bretttully commented on July 18, 2024

Worth mentioning that this definitely makes users' code more verbose and is not as simple as the pandas df[vars] -- however, having watched people new to pandas get up to speed, the amount of confusion between df[var_name], df[[var_name]], df[vars] and df.loc[:, vars], and when to use which -- and then patterns emerge like df.loc[:, [var_name]].copy() to always get a DataFrame back that you can go on an add more columns. There is a part of me that thinks more verbosity and explicitness can be beneficial.

I'm on the fence, and you guys have better knowledge about how AnnData is currently being used.

from anndata.

github-actions avatar github-actions commented on July 18, 2024

This issue has been automatically marked as stale because it has not had recent activity.
Please add a comment if you want to keep the issue open. Thank you for your contributions!

from anndata.

flying-sheep avatar flying-sheep commented on July 18, 2024

Fixed by #171

from anndata.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.