davisidarta / topometry Goto Github PK
View Code? Open in Web Editor NEWSystematically learn and evaluate manifolds from high-dimensional data
Home Page: https://topometry.readthedocs.io/en/latest/
License: MIT License
Systematically learn and evaluate manifolds from high-dimensional data
Home Page: https://topometry.readthedocs.io/en/latest/
License: MIT License
I think some people (myself included) will come across this method being familiar with UMAP and classical dimensionality reduction techniques, but it might not be clear when to use UMAP vs. TopOMetry. Could you comment on this?
Might consider adding to the docs and/or README, but feel free to ignore the suggestion as well.
am not sure why i get this error , thanks in advance
tg = tp.TopOGraph(base_knn=20,
n_eigs=100, # set this to the largest estimate!
n_jobs=-1,
verbosity=0)
# Run a TopOMetry model for PBMC3K
adata = tp.sc.topological_workflow(
adata, # the anndata object
tg, # the TopOGraph object
kernels=['bw_adaptive'],# the kernel(s) to use
eigenmap_methods=['DM'],# the eigenmap method(s) to use
projections=['MAP'], # the projection(s) to use
resolution=0.8 # the Leiden clustering resolution
)
----> 7 adata = tp.sc.topological_workflow(
8 adata, # the anndata object
9 tg, # the TopOGraph object
10 kernels=['bw_adaptive'],# the kernel(s) to use
11 eigenmap_methods=['DM'],# the eigenmap method(s) to use
12 projections=['MAP'], # the projection(s) to use
13 resolution=0.8 # the Leiden clustering resolution
14 )
AttributeError: module 'topo' has no attribute 'sc'
Hi, I have recently started to play with single cell data analysis and your biorxiv paper and approach sounds really interesting. I understand how PCA dimensional reduction is probably the wrong assumption about the topology of the underlying data.
In that sense, I was curious how do you handle integration of multiple datasets. Traditionally, this integration is based on common variable PCs (eigenvectors), but those are selected assuming the same underlying topology. In your package, I can see that if we have biological replicates, we could select the same topology for two libraries and then select the most variable eigenvectors for integration, but what happens if the biological replicates use different library prep which introduce some kind of batch effect? That batch effect would influence the selection of the best fitting topology for the data and could make it difficult if not impossible to integrate datasets that should have a shared underlying biology and composition.
My question is, how can we handle these scenarios with Topometry? How to best select eigenvectors for integration of multiple datasets and how do we prevent sampling methods from introducing batch artifacts into the model?
These may be naive questions, but I would like to understand your take on these.
Best regards,
Sorry for Bothering again, but after testing on my entire dataset not just 1 sample, around 3/4 million cell, i observe this MAP projection and am not sure what could be the issue while doing same exact worflow on just 1 libarary the projection looks normal (11k cells) or when subsetting the same adata object to few number of cells.
here is the code and projection below,
adata=sc.read_h5ad("/pasteur/zeus/projets/p02/LabExMI/singleCell/V3/scRNA_NS_IAV_COV/results/merged_object/adata_scvi.h5ad")
adata.X=adata.X.toarray()
sc.pp.scale(adata, max_value=10)
adata = adata[:, adata.var.highly_variable]
tg = tp.TopOGraph(n_eigs=150, n_jobs=-1, verbosity=0)
tg.run_models(adata.X, kernels=['bw_adaptive'],
eigenmap_methods=['DM'],
projections=['MAP'])
tg
TopOGraph object with 727581 samples and 12855 observations and:
. Base Kernels:
bw_adaptive - .BaseKernelDict['bw_adaptive']
. Eigenbases:
DM with bw_adaptive - .EigenbasisDict['DM with bw_adaptive']
. Graph Kernels:
bw_adaptive from DM with bw_adaptive - .GraphKernelDict['bw_adaptive from DM with bw_adaptive']
. Projections:
MAP of bw_adaptive from DM with bw_adaptive - .ProjectionDict['MAP of bw_adaptive from DM with bw_adaptive']
Active base kernel - .base_kernel
Active eigenbasis - .eigenbasis
Active graph kernel - .graph_kernel
adata.obsm['X_topoMAP'] = tg.ProjectionDict['MAP of bw_adaptive from DM with bw_adaptive']
I'm getting an error when trying to run this bit of code in Colab:
import topo as tp
tg = tp.TopOGraph()
tg.run_layouts(emb, n_components=2, bases=['diffusion', 'fuzzy'], graphs=['diff', 'fuzzy'], layouts=['MAP', 'PaCMAP'])
The error is:
TypeError: PaCMAP.__init__() got an unexpected keyword argument 'n_dims'
I can run pacMAP outside of topo just fine so this looks like an easy fix.
hey davi,
thanks for the great tool, I just started diving into looking into my data using the evaluation functions. Just wanted to point out that the trustworthniness function in the eval/local_scores.py is missing and can't be loaded, was it removed from a previous version, also here on github it's not inlcuded in the file
best
Daniel
using same code with MAP projecion works fine
projections=["MAP"]
`tg = tp.TopOGraph(n_eigs=119, n_jobs=-1, verbosity=0)
tg.run_models(adata.X, kernels=['bw_adaptive'],
eigenmap_methods=['msDM'],
projections=["PaCMAP"])
ValueError Traceback (most recent call last)
Cell In[65], line 3
1 tg = tp.TopOGraph(n_eigs=119, n_jobs=-1, verbosity=0)
----> 3 tg.run_models(adata.X, kernels=['bw_adaptive'],
4 eigenmap_methods=['msDM'],
5 projections=["PaCMAP"])
File ~/venvs/topometry/lib/python3.11/site-packages/topo/topograph.py:1013, in TopOGraph.run_models(self, X, kernels, eigenmap_methods, projections)
1011 gc.collect()
1012 for projection in projections:
-> 1013 self.project(projection_method=projection)
1014 gc.collect()
File ~/venvs/topometry/lib/python3.11/site-packages/topo/topograph.py:925, in TopOGraph.project(self, n_components, init, projection_method, landmarks, landmark_method, n_neighbors, num_iters, **kwargs)
923 projection_key = projection_method + ' of ' + key
924 start = time.time()
--> 925 Y = Projector(n_components=n_components,
926 projection_method=projection_method,
927 metric=metric,
928 n_neighbors=self.graph_knn,
929 n_jobs=self.n_jobs,
930 landmarks=landmarks,
931 landmark_method=landmark_method,
932 num_iters=num_iters,
933 init=init_Y,
934 nbrs_backend=self.backend,
935 keep_estimator=False,
936 random_state=self.random_state,
937 verbose=self.layout_verbose).fit_transform(input, **kwargs)
938 end = time.time()
939 gc.collect()
File ~/venvs/topometry/lib/python3.11/site-packages/sklearn/utils/_set_output.py:157, in _wrap_method_output..wrapped(self, X, *args, **kwargs)
155 @wraps(f)
156 def wrapped(self, X, *args, **kwargs):
--> 157 data_to_wrap = f(self, X, *args, **kwargs)
158 if isinstance(data_to_wrap, tuple):
159 # only wrap the first output for cross decomposition
160 return_tuple = (
161 _wrap_data_with_container(method, data_to_wrap[0], X, self),
162 *data_to_wrap[1:],
163 )
File ~/venvs/topometry/lib/python3.11/site-packages/topo/layouts/projector.py:416, in Projector.fit_transform(self, X, **kwargs)
405 def fit_transform(self, X, **kwargs):
406 """
407 Calls the fit_transform method of the desired method.
408 If the desired method does not have a fit_transform method, calls the results from the fit method.
(...)
413 Projection results
414 """
--> 416 self.fit(X, **kwargs)
417 return self.Y_
File ~/venvs/topometry/lib/python3.11/site-packages/topo/layouts/projector.py:318, in Projector.fit(self, X, **kwargs)
315 metric = self.metric
316 self.estimator_ = pacmap.PaCMAP(n_components=self.n_components, n_neighbors=self.n_neighbors,
317 apply_pca=False, distance=metric, num_iters=self.num_iters, verbose=self.verbose, **kwargs)
--> 318 self.Y_ = self.estimator_.fit_transform(X=X, init=self.init_Y_)
320 elif self.projection_method == 'TriMAP':
321 try:
File ~/venvs/topometry/lib/python3.11/site-packages/pacmap/pacmap.py:943, in PaCMAP.fit_transform(self, X, init, save_pairs)
925 def fit_transform(self, X, init=None, save_pairs=True):
926 '''Projects a high dimensional dataset into a low-dimensional embedding and return the embedding.
927
928 Parameters
(...)
940 Whether to save the pairs that are sampled from the dataset. Useful for reproducing results.
941 '''
--> 943 self.fit(X, init, save_pairs)
944 if self.intermediate:
945 return self.intermediate_states
File ~/venvs/topometry/lib/python3.11/site-packages/pacmap/pacmap.py:906, in PaCMAP.fit(self, X, init, save_pairs)
904 self.num_dimensions = X.shape[1]
905 # Initialize and Optimize the embedding
--> 906 self.embedding_, self.intermediate_states, self.pair_neighbors, self.pair_MN, self.pair_FP = pacmap(
907 X,
908 self.n_components,
909 self.pair_neighbors,
910 self.pair_MN,
911 self.pair_FP,
912 self.lr,
913 self.num_iters,
914 init,
915 self.verbose,
916 self.intermediate,
917 self.intermediate_snapshots,
918 pca_solution,
919 self.tsvd_transformer
920 )
921 if not save_pairs:
922 self.del_pairs()
File ~/venvs/topometry/lib/python3.11/site-packages/pacmap/pacmap.py:539, in pacmap(X, n_dims, pair_neighbors, pair_MN, pair_FP, lr, num_iters, Yinit, verbose, intermediate, inter_snapshots, pca_solution, tsvd)
536 intermediate_states = None
538 # Initialize the embedding
--> 539 if (Yinit is None or Yinit == "pca"):
540 if pca_solution:
541 Y = 0.01 * X[:, :n_dims]
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()`
FYI, I started a PR to distribute this package through Conda Forge. Conda Forge works downstream of PyPI, so there is nothing that needs to be done here to support this. Please let me know if you'd like to be included as a maintainer on the Conda recipe - it is entirely optional and updating is mostly automated anyway.
Thanks for this neat work!
Hi @davisidarta,
Great work! I think this package will really help expand how we are using the KNN graph structures of single-cell datasets.
I was taking a closer into the Diffusor class and found that you don't actually use the parameter X
in Diffusor.transform. Does this mean that the data can only be self-transformed? Or probably I'm missing something?
Best,
Parashar
First off thank you so much for the great package and manuscript. It's awesome to see such important work done in dimensionality reduction/single cell unifying and comparing the DR methods.
My question is about evaluating embedding. In the documentation, you showed how to compare the embeddings based on PCA loss and geodesic spearman R. If I cluster the data and know the cluster labels I can also use metrics like Adjusted Rand Index and Adjusted Mutual Information to evaluate the clustering (and the embedding indirectly).
However, if I only have a binary or continuous column (like expression of a gene) and wants to see how good the embedding "explains" my column (the relationship can be highly non-linear), and evaluate the different embedding methods/parameters based on that, what should I do? I can train an XGBoost model to predict my target and get R-squared/AUC using only the embedding, but then I'll also have the model's hyperparameters to tune. Do you have any suggestions for this problem?
sorry if my question is naive but its for better understanding
1-in a classical workflow, if i use Topometry, is the eigncomponents here considered the dimensionality reduction method?
2- for my adata.x object I observed an eigengap around 120, so does the projection used in the model use this number of eigenvectors to do the projection?
3-also for comparison of results shouldn't i use (120 pca) equivalent to 120 EV and use those to compute neighbors and plot a umap to compare there results with tg.ProjectionDict['MAP of bw_adaptive from msDM with bw_adaptive']
4- a cell type having a higher i.d. estimates than other, how to interept this and could this be just an effect of the cell proportion being small ( low number of cells of thos celltype should lead to high I.d right? AND shold mean they doesnt cluster very well together?)
finally
i did this comparision using same number of pca to generate knn and then the umap with scanpy and another time with topoMAP
here is both of projection for my dataset of pbmc that contain different individuals with different condition (non stimualted and stimualted with covid) how could u interpret the different visualization specifically in the b cell cluster in pink
tg = tp.TopOGraph(n_eigs=119, n_jobs=-1, verbosity=0)
tg.run_models(adata.X, kernels=['bw_adaptive'],
eigenmap_methods=['msDM'],
projections=['MAP','UMAP'])
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.