Comments (8)
Hi @davisidarta
Are you still interested in this?
from scarf.
Hi! Thanks!
The reason that KNN graph computation is hard coded to PCA (and LSI for scATAC-Seq data) is simply because of availability of out-of-core implementation of PCA in the Python ecosystem (sklearn in this case). One of the mottos of Scarf is to be memory efficient and that's why we do not support methods which will violate that motto.
But as you suggest, providing an externally computed KNN graph and dimension reductions into Scarf can be be a viable alternative.
Importing reduced dimensions for purpose of graph calculation is a bit tricky at this point. It is however, quite possible. Suggestions are welcome.
I'm working on feature to import KNN graph directly from H5ad (anndata) file, will that work you at this stage? If yes, then I can prioritize it. Please let me know as any early suggestions might be helpful.
from scarf.
Thanks for the swift reply :)
The reason that KNN graph computation is hard coded to PCA (and LSI for scATAC-Seq data) is simply because of availability of out-of-core implementation of PCA in the Python ecosystem (sklearn in this case).
I understand. That's reasonable.
Importing reduced dimensions for purpose of graph calculation is a bit tricky at this point. It is however, quite possible. Suggestions are welcome.
Maybe playing around with transform_ann could do it? I'm unused to work with out-of-memory computing, but it seems like the kNN graph step comes right after obtaining PCA/LSI embeddings. Is that right? If so, maybe adding an option to, if the reduction method is not PCA nor LSI, it must be a user provided np.ndarray to compute the kNN graph on?
I'm working on feature to import KNN graph directly from H5ad (anndata) file, will that work you at this stage?
For my specific case, I'd like to both add dimensionality reductions and kNN graphs, but being able to build a new kNN graph out-of-memory on top of an externally provided dimensional reduction would be fantastic enough. Thank you for being so helpful.
from scarf.
Yes the changes will need to made in AnnStream class and also to three methods from GraphDataStore : _choose_reduction_method, _set_graph_params and make_graph.
Overriding the default reduction methods with user provided matrix should be entirely possible.
My only concern here is that this will break the logic of MappingDataStore which expects to find PCA loadings.
Let me work out a good strategy here. Do you have any more suggestions at this point?
from scarf.
Hi Davi,
I have something in the works now that may solve the issue.
Since the philosophy with Scarf is to ensure memory efficiency, it will not be ideal to have even the dimension reduced data as input.
The solution here is to upload a transformer, for example, PCA loadings of the form (d x p)
where d
is the number of features and p
is the number of reduced dimensions. We can use such a custom, user-provided matrix and calculate its dot product with the normalized data. This is done in an iterative fashion so that whole data need not be loaded into memory. Additionally, the user-provided matrix will be saved into the Zarr hierarchy so that it can later be reused for transforming external samples as well (for example when performing projection).
Obviously, the caveat here is that this only works for linear dimension reductions.
What do you think?
from scarf.
Hi,
Version 0.7.6 now contains the ability to provide external transformers here is an example:
import scarf
from sklearn.decomposition import PCA
scarf.fetch_dataset('tenx_5K_pbmc_rnaseq', save_path='scarf_datasets', as_zarr=True)
ds = scarf.DataStore('scarf_datasets/tenx_5K_pbmc_rnaseq/data.zarr')
hvg_idx = ds.RNA.feats.get_index_by(
ds.RNA.feats.fetch('ids', key='I__hvgs'), 'ids')
df = ds.RNA.normed(feat_idx=hvg_idx, log_transform=True,
renormalize_subset=True).compute()
pca = PCA(n_components=5)
pca.fit(df)
ds.make_graph(feat_key='hvgs', k=11, custom_loadings=pca.components_.T)
from scarf.
Hi! I'm really sorry for my late reply.
I have something in the works now that may solve the issue.
Since the philosophy with Scarf is to ensure memory efficiency, it will not be ideal to have even the dimension reduced data as input.
Would it be possible, though, at the user's discretion? Or would it be incompatible with the object architecture entirely?
The solution here is to upload a transformer, for example, PCA loadings of the form (d x p) where d is the number of features and p is the number of reduced dimensions. We can use such a custom, user-provided matrix and calculate its dot product with the normalized data. This is done in an iterative fashion so that whole data need not be loaded into memory. Additionally, the user-provided matrix will be saved into the Zarr hierarchy so that it can later be reused for transforming external samples as well (for example when performing projection).
I think this is a wonderful idea that could couple with methods such as the new liger integrated method, MOFA, NMF, and perhaps ICA. I really appreciate your swiftness in making this available - I think this would be really great with MOFA and NMF, and I'll try it out.
However, as you've remembered,
Obviously, the caveat here is that this only works for linear dimension reductions.
The thing is I've developed a new family of non-linear topological dimensionality reduction approaches, which are implemented in TopOMetry. I'm keen to use Scarf instead of Scanpy or Pegasus in my team's workflow, but our main advantage at this point is the really high-resolution maps enabled by these non-linear mappings in our biological systems of interest. Although PCA preserves global distances well, the kNN graph built on it is not granted to preserve topology.
Do you think it would be possible to add a user option of providing a pre-computed dimensionality reduction?
from scarf.
Hi @davisidarta,
Glad to have your comments.
It will be really nice to see if you find the new external transformer feature useful.
I'm taking a deeper look into TopOMetry
to investigate the possibilities for the two packages to play nice with each other. There are a lot of interesting concepts in TopOMetry
.
Let me summarize how I understand the default steps of dimension reduction in TopOMetry:
- It starts with a 'hvg_matrix' from Scanpy (normalized data containing a feature subset)
- A KNN graph is trained directly on this matrix using HNSWlib, NMSlib or Sklearn.
- A series of operations are preformed on this graph to obtain a symmetrical continuous form
- A diffusion matrix of the graph is calculated
- Eigendecomposition (I don't think I fully understand this, but that is on me) of the diffusion matrix is performed
- Multiscale components (lower dimension representation of the data) are obtained from the diffusion basis.
So here are my thoughts if you would like to use Scarf as a backend for TopOMetry.
- I think allowing
make_graph
to take a reduced representation of the data from TopOMetry is not an optimal solution at this point; but thats not the end of the road. - You can use Scarf's pipeline to obtain a KNN graph the data (feature subselected). At this step you can turn off the dimension reduction step by setting
dims==0
. This is something you already do but have a much larger selection of methods to create graph with NMSlib. The advantage you have in using Scarf is that you will able to get the first graph very memory efficiently. - You can obtain the KNN indices and distances of this graph from Scarf (I can further elaborate on this). Alternatively, you can use directly use
load_graph
method to get a normalized graph as is done in the UMAP algorithm. - From what I understand, you now derive the reduced dimensions directly using such a graph and do not need the 'hvg matrix' anymore (which is why tweaking
make_graph
will not be useful). Please let me know if I'm wrong in this assumption. I created an issue on TopOMetry repo to further discuss this point. - Once, you have used the Scarf's graph and performed dimension reduction on it, you can now add this reduced dimension data as a new
Assay
into theDataStore
. You can then create a graph again for this assay (turning off dimred again) and then use all the downstream features of Scarf.
The benefit, that I see of this approach is that, it can make TopOMetry highly scalable to larger datasets.
Keen to have your comments and thoughts on such an approach?
Best,
Parashar
from scarf.
Related Issues (20)
- ZarrMerge behaves weirdly
- Typos and small documentation improvements
- Interoperability documentation - Scanpy HOT 1
- Allow mapping using gene names
- Error in conversion from AnnData to Zarr
- Conversion from anndata to zarr and umap visualization
- ATAC-seq recognized as RNA-seq? HOT 2
- clusters input for TopACeDo HOT 3
- run_clustering() error using Paris clustering algorithm HOT 2
- h5py - minimal requirement
- ZarrMerge producing empty counts while .zarr files used to merge have associated counts? HOT 6
- ds.to_anndata() and scarf.writers.to_h5ad() crteate two different data structures HOT 4
- Support for Trimodal multiomic data analysis
- Documentation enhancement for SubsetZarr HOT 1
- Error: Conversion Anndata to Scarfโs Zarr format file HOT 2
- How can I generate the UMAP like this (cluster centroids). Can I also change the format and position of this figure? HOT 1
- cannot make UMAP HOT 4
- How to add feature name changed by gene name? HOT 2
- Unsure what this error message is trying to tell me
- pip installation fails with this ERROR: Could not install packages due to an OSError: [Errno 30] Read-only file system: '/opt/python/3.9.14/lib/python3.9/site-packages/texttable.py'
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from scarf.