davisidarta / topometry Goto Github PK

Systematically learn and evaluate manifolds from high-dimensional data

Home Page: https://topometry.readthedocs.io/en/latest/

License: MIT License

Python 100.00%

clustering data-science data-visualization dimensionality-reduction graph graph-layout hypothesis-generation laplace-beltrami machine-learning manifold-learning scikit-learn single-cell visualization

topometry's Introduction

Hi! I'm Davi

I develop tools to understand and interpret high-dimensional data, with a focus on single-cell omics.

I developed TopOMetry, a comprehensive framework for high-dimensional data analysis. TopOMetry learns similarity graphs, estimates the dimensionality of the data, obtains latent dimensions using topological operators, clusters samples and layouts topological graphs into two-dimensional visualizations. TopOMetry learns and evaluates dozens of possible visualizations so that users do not have to stick with any pre-determined model (e.g. t-SNE or UMAP). It was designed to be compatible with a scikit-learn centered workflow, as most classes and functions can be pipelined. TopOMetry manuscript is freely available at BioRxiv.
I'm currently a postdoc at Ana Domingos' lab at the University of Oxford. We are working on generating and analyzing single-cell datasets from a variety of tissues relevant to obesity and metabolism to build updated comprehensive neuroanatomical maps with cellular resolution. These will serve as a foundation for new studies investigating cellular-specific therapeutic targets for obesity and its comorbidities.

I'm always open to interesting conversations and enjoy getting involved in many projects. Feel free to reach me by email.

I tweet about medicine, neuroscience, computational biology, machine learning, and sometimes about my personal life.

topometry's People

Contributors

Stargazers

Watchers

Forkers

vishalbelsare crsky1023 dbrg77 vingilotai

topometry's Issues

library integration

Hi, I have recently started to play with single cell data analysis and your biorxiv paper and approach sounds really interesting. I understand how PCA dimensional reduction is probably the wrong assumption about the topology of the underlying data.
In that sense, I was curious how do you handle integration of multiple datasets. Traditionally, this integration is based on common variable PCs (eigenvectors), but those are selected assuming the same underlying topology. In your package, I can see that if we have biological replicates, we could select the same topology for two libraries and then select the most variable eigenvectors for integration, but what happens if the biological replicates use different library prep which introduce some kind of batch effect? That batch effect would influence the selection of the best fitting topology for the data and could make it difficult if not impossible to integrate datasets that should have a shared underlying biology and composition.

My question is, how can we handle these scenarios with Topometry? How to best select eigenvectors for integration of multiple datasets and how do we prevent sampling methods from introducing batch artifacts into the model?
These may be naive questions, but I would like to understand your take on these.
Best regards,

Conda Forge build

FYI, I started a PR to distribute this package through Conda Forge. Conda Forge works downstream of PyPI, so there is nothing that needs to be done here to support this. Please let me know if you'd like to be included as a maintainer on the Conda recipe - it is entirely optional and updating is mostly automated anyway.

Thanks for this neat work!

When to consider using TopOMetry over UMAP?

I think some people (myself included) will come across this method being familiar with UMAP and classical dimensionality reduction techniques, but it might not be clear when to use UMAP vs. TopOMetry. Could you comment on this?

Might consider adding to the docs and/or README, but feel free to ignore the suggestion as well.

paCMAP generate an error ,

using same code with MAP projecion works fine

projections=["MAP"]

`tg = tp.TopOGraph(n_eigs=119, n_jobs=-1, verbosity=0)

tg.run_models(adata.X, kernels=['bw_adaptive'],
eigenmap_methods=['msDM'],
projections=["PaCMAP"])

ValueError Traceback (most recent call last)
Cell In[65], line 3
1 tg = tp.TopOGraph(n_eigs=119, n_jobs=-1, verbosity=0)
----> 3 tg.run_models(adata.X, kernels=['bw_adaptive'],
4 eigenmap_methods=['msDM'],
5 projections=["PaCMAP"])

File ~/venvs/topometry/lib/python3.11/site-packages/topo/topograph.py:1013, in TopOGraph.run_models(self, X, kernels, eigenmap_methods, projections)
1011 gc.collect()
1012 for projection in projections:
-> 1013 self.project(projection_method=projection)
1014 gc.collect()

File ~/venvs/topometry/lib/python3.11/site-packages/topo/topograph.py:925, in TopOGraph.project(self, n_components, init, projection_method, landmarks, landmark_method, n_neighbors, num_iters, **kwargs)
923 projection_key = projection_method + ' of ' + key
924 start = time.time()
--> 925 Y = Projector(n_components=n_components,
926 projection_method=projection_method,
927 metric=metric,
928 n_neighbors=self.graph_knn,
929 n_jobs=self.n_jobs,
930 landmarks=landmarks,
931 landmark_method=landmark_method,
932 num_iters=num_iters,
933 init=init_Y,
934 nbrs_backend=self.backend,
935 keep_estimator=False,
936 random_state=self.random_state,
937 verbose=self.layout_verbose).fit_transform(input, **kwargs)
938 end = time.time()
939 gc.collect()

File ~/venvs/topometry/lib/python3.11/site-packages/sklearn/utils/_set_output.py:157, in _wrap_method_output..wrapped(self, X, *args, **kwargs)
155 @wraps(f)
156 def wrapped(self, X, *args, **kwargs):
--> 157 data_to_wrap = f(self, X, *args, **kwargs)
158 if isinstance(data_to_wrap, tuple):
159 # only wrap the first output for cross decomposition
160 return_tuple = (
161 _wrap_data_with_container(method, data_to_wrap[0], X, self),
162 *data_to_wrap[1:],
163 )

File ~/venvs/topometry/lib/python3.11/site-packages/topo/layouts/projector.py:416, in Projector.fit_transform(self, X, **kwargs)
405 def fit_transform(self, X, **kwargs):
406 """
407 Calls the fit_transform method of the desired method.
408 If the desired method does not have a fit_transform method, calls the results from the fit method.
(...)
413 Projection results
414 """
--> 416 self.fit(X, **kwargs)
417 return self.Y_

File ~/venvs/topometry/lib/python3.11/site-packages/topo/layouts/projector.py:318, in Projector.fit(self, X, **kwargs)
315 metric = self.metric
316 self.estimator_ = pacmap.PaCMAP(n_components=self.n_components, n_neighbors=self.n_neighbors,
317 apply_pca=False, distance=metric, num_iters=self.num_iters, verbose=self.verbose, **kwargs)
--> 318 self.Y_ = self.estimator_.fit_transform(X=X, init=self.init_Y_)
320 elif self.projection_method == 'TriMAP':
321 try:

File ~/venvs/topometry/lib/python3.11/site-packages/pacmap/pacmap.py:943, in PaCMAP.fit_transform(self, X, init, save_pairs)
925 def fit_transform(self, X, init=None, save_pairs=True):
926 '''Projects a high dimensional dataset into a low-dimensional embedding and return the embedding.
927
928 Parameters
(...)
940 Whether to save the pairs that are sampled from the dataset. Useful for reproducing results.
941 '''
--> 943 self.fit(X, init, save_pairs)
944 if self.intermediate:
945 return self.intermediate_states

File ~/venvs/topometry/lib/python3.11/site-packages/pacmap/pacmap.py:906, in PaCMAP.fit(self, X, init, save_pairs)
904 self.num_dimensions = X.shape[1]
905 # Initialize and Optimize the embedding
--> 906 self.embedding_, self.intermediate_states, self.pair_neighbors, self.pair_MN, self.pair_FP = pacmap(
907 X,
908 self.n_components,
909 self.pair_neighbors,
910 self.pair_MN,
911 self.pair_FP,
912 self.lr,
913 self.num_iters,
914 init,
915 self.verbose,
916 self.intermediate,
917 self.intermediate_snapshots,
918 pca_solution,
919 self.tsvd_transformer
920 )
921 if not save_pairs:
922 self.del_pairs()

File ~/venvs/topometry/lib/python3.11/site-packages/pacmap/pacmap.py:539, in pacmap(X, n_dims, pair_neighbors, pair_MN, pair_FP, lr, num_iters, Yinit, verbose, intermediate, inter_snapshots, pca_solution, tsvd)
536 intermediate_states = None
538 # Initialize the embedding
--> 539 if (Yinit is None or Yinit == "pca"):
540 if pca_solution:
541 Y = 0.01 * X[:, :n_dims]

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()`

Evaluating embedding based on how good the embedding explains a continuous/binary column

First off thank you so much for the great package and manuscript. It's awesome to see such important work done in dimensionality reduction/single cell unifying and comparing the DR methods.

My question is about evaluating embedding. In the documentation, you showed how to compare the embeddings based on PCA loss and geodesic spearman R. If I cluster the data and know the cluster labels I can also use metrics like Adjusted Rand Index and Adjusted Mutual Information to evaluate the clustering (and the embedding indirectly).

However, if I only have a binary or continuous column (like expression of a gene) and wants to see how good the embedding "explains" my column (the relationship can be highly non-linear), and evaluate the different embedding methods/parameters based on that, what should I do? I can train an XGBoost model to predict my target and get R-squared/AUC using only the embedding, but then I'll also have the model's hyperparameters to tune. Do you have any suggestions for this problem?

PacMAP error

I'm getting an error when trying to run this bit of code in Colab:

import topo as tp

tg = tp.TopOGraph()
tg.run_layouts(emb, n_components=2, bases=['diffusion', 'fuzzy'], graphs=['diff', 'fuzzy'], layouts=['MAP', 'PaCMAP'])

The error is:
TypeError: PaCMAP.__init__() got an unexpected keyword argument 'n_dims'

I can run pacMAP outside of topo just fine so this looks like an easy fix.

Map projection looks odd with high number of cells

Sorry for Bothering again, but after testing on my entire dataset not just 1 sample, around 3/4 million cell, i observe this MAP projection and am not sure what could be the issue while doing same exact worflow on just 1 libarary the projection looks normal (11k cells) or when subsetting the same adata object to few number of cells.
here is the code and projection below,

adata=sc.read_h5ad("/pasteur/zeus/projets/p02/LabExMI/singleCell/V3/scRNA_NS_IAV_COV/results/merged_object/adata_scvi.h5ad")
adata.X=adata.X.toarray()
sc.pp.scale(adata, max_value=10)

adata = adata[:, adata.var.highly_variable]
 tg = tp.TopOGraph(n_eigs=150, n_jobs=-1, verbosity=0)
tg.run_models(adata.X, kernels=['bw_adaptive'],
                   eigenmap_methods=['DM'],
                    projections=['MAP'])
 tg
TopOGraph object with 727581 samples and 12855 observations and:
 . Base Kernels: 
    bw_adaptive - .BaseKernelDict['bw_adaptive']
 . Eigenbases: 
    DM with bw_adaptive - .EigenbasisDict['DM with bw_adaptive']
 . Graph Kernels: 
    bw_adaptive from DM with bw_adaptive - .GraphKernelDict['bw_adaptive from DM with bw_adaptive']
 . Projections: 
    MAP of bw_adaptive from DM with bw_adaptive - .ProjectionDict['MAP of bw_adaptive from DM with bw_adaptive'] 
 Active base kernel  -  .base_kernel 
 Active eigenbasis  -  .eigenbasis 
 Active graph kernel  -  .graph_kernel


adata.obsm['X_topoMAP'] = tg.ProjectionDict['MAP of bw_adaptive from DM with bw_adaptive']

module 'topo' has no attribute 'sc'

am not sure why i get this error , thanks in advance

tg = tp.TopOGraph(base_knn=20, 
                        n_eigs=100, # set this to the largest estimate!
                        n_jobs=-1,
                        verbosity=0)

# Run a TopOMetry model for PBMC3K
adata = tp.sc.topological_workflow(
    adata,                  # the anndata object
    tg,                # the TopOGraph object
    kernels=['bw_adaptive'],# the kernel(s) to use
    eigenmap_methods=['DM'],# the eigenmap method(s) to use
    projections=['MAP'],    # the projection(s) to use
    resolution=0.8          # the Leiden clustering resolution
)
----> 7 adata = tp.sc.topological_workflow(
      8     adata,                  # the anndata object
      9     tg,                # the TopOGraph object
     10     kernels=['bw_adaptive'],# the kernel(s) to use
     11     eigenmap_methods=['DM'],# the eigenmap method(s) to use
     12     projections=['MAP'],    # the projection(s) to use
     13     resolution=0.8          # the Leiden clustering resolution
     14 )

AttributeError: module 'topo' has no attribute 'sc'

trustworthiness function missing

hey davi,

thanks for the great tool, I just started diving into looking into my data using the evaluation functions. Just wanted to point out that the trustworthniness function in the eval/local_scores.py is missing and can't be loaded, was it removed from a previous version, also here on github it's not inlcuded in the file

best
Daniel

comparing classic umap and topometry

sorry if my question is naive but its for better understanding

1-in a classical workflow, if i use Topometry, is the eigncomponents here considered the dimensionality reduction method?

2- for my adata.x object I observed an eigengap around 120, so does the projection used in the model use this number of eigenvectors to do the projection?

3-also for comparison of results shouldn't i use (120 pca) equivalent to 120 EV and use those to compute neighbors and plot a umap to compare there results with tg.ProjectionDict['MAP of bw_adaptive from msDM with bw_adaptive']

4- a cell type having a higher i.d. estimates than other, how to interept this and could this be just an effect of the cell proportion being small ( low number of cells of thos celltype should lead to high I.d right? AND shold mean they doesnt cluster very well together?)

finally
i did this comparision using same number of pca to generate knn and then the umap with scanpy and another time with topoMAP

here is both of projection for my dataset of pbmc that contain different individuals with different condition (non stimualted and stimualted with covid) how could u interpret the different visualization specifically in the b cell cluster in pink

tg = tp.TopOGraph(n_eigs=119, n_jobs=-1, verbosity=0)

tg.run_models(adata.X, kernels=['bw_adaptive'],
                   eigenmap_methods=['msDM'],
                   projections=['MAP','UMAP'])

Input parameter `X` in Diffusor.transform

Hi @davisidarta,

Great work! I think this package will really help expand how we are using the KNN graph structures of single-cell datasets.

I was taking a closer into the Diffusor class and found that you don't actually use the parameter X in Diffusor.transform. Does this mean that the data can only be self-transformed? Or probably I'm missing something?

Best,
Parashar