brianhie / geosketch Goto Github PK

View Code? Open in Web Editor NEW

74.0 74.0 10.0 355 KB

Geometry-preserving random sampling

Home Page: http://geosketch.csail.mit.edu

License: MIT License

Python 97.98% Shell 1.75% R 0.27%

geosketch's People

Contributors

Stargazers

Watchers

Forkers

bendemeo hhcho watronfire zorrodong wenmm rstatistics prete jfnavarro 2019surbhi ekimb

geosketch's Issues

Pandas DataFrame as input

Hi. Is it possible to run geosketch on a DataFrame object? I need to save row- and colnames of my count matrix.

Not really the rarest cell types?

I am looking at the data underlying Figure 2C, and I noticed that for the adult mouse brain data set, the macrophages are not actually the rarest cell type. They are something like the third-most rare cell type. Is there a reason you focused on this particular cell type for that data set, rather than the mitotic cells?

Determining optimal sampling rate

Hi,

I have been playing around with geosketch with the aim of speeding up the creation of large single cell reference panoramas with scanorama.

I am using the first 20 components from PCA (in the Seurat implementation after SCTransform) to define the transcriptomic space and so far I'm quite happy with the results.

Here you can see a test I'm doing on the Lake et al. human snRNAseq brain dataset. It's a small set compared to the Saunders et al. mouse brain, but it's part of the datasets I will integrate; it may not make sense to sketch an already small sized dataset, but I just wanted to practice on this.

First question:
is there an advantage in using geosketch on a relatively high-dimensional reduction such as the first 20 components in PCA, over using it on the 2 dimensions of the UMAP reduction (as returned by the Seurat implementation)?

Second question
Eventually I am going to integrate a sizeable number of datasets, and I was wondering how to determine the "sweet spot" in the trade-off between a small sampling rate (less cells -> faster downstream batch correction/stitching) and representativity (enough cells per cell type, enough "marker" genes expressed at comparable levels, reproducibility of clusters, etc).

I reckon that there are some possible empirical solutions, each of them making a bunch of assumptions:

determine a sampling rate so that the smallest cluster still retains a minimum, arbitrary number of cells
choice of an arbitrary partial Hausdorff¹ cutoff: I tried moving along 10% to 99% sampling rate and using 10 iterations with q = 1e-3 (set it higher than your suggestion to account for a smaller number of cells), the results on partial HD are here:

I was somehow expecting the curve to reduce its steepness when approaching higher sampling %, but I guess this does not happen because of the small sample size?

same as in 2), but using BAMI or any other measure of consistency between clusterings, as in your paper
same as in 2), but using consistency of marker finding between full and sketched datasets (may be biased towards high sampling rates)
same as in 2), but using consistency of per-cluster and per-dataset dispersion estimation (may be biased towards high sampling rates)

Do you have any other suggestions? I know you applied many metrics to determine how low one could possibly go in the sampling rate without losing too much information, but it seems that the performance of the sketching procedure varies across datasets and sampling rates (fig. 3D of your paper), maybe according to sample complexity and/or sample size.

Many thanks in advance and thanks for the very nice tool.

¹: This is my R implementation of the partial HD, written by editing the pracma::hausdorff_dist function:

partial_hd <- function(X, S, q)
{
    stopifnot(is.numeric(X), is.numeric(S), is.numeric(q))
    if (is.vector(X))
        X <- matrix(X, ncol = 1)
    if (is.vector(S))
        S <- matrix(S, ncol = 1)
    if (ncol(X) != ncol(S))
        stop("'X' and 'S' must have the same number of columns.")
    if (q > 1 | q < 0)
    	stop("q must be between 0 and 1 inclusive.")
    K = floor((1 - q) * dim(X)[1])
    D <- pracma::distmat(X, S)
    min_XS <- apply(D, 1, min)
    min_XS <- min_XS[order(min_XS, decreasing = FALSE)]
    return(min_XS[K])
}

In the nomenclature it assumes that you want to calculate the distance of the full set X to the sketch S, although in the Huttenlocher paper partial HD is calculated from S to X - should not make a big difference although the opposite calculation may need a different value of q when |X| >> |S|

Index issue of the R wrapper

Hi,

I think this is a great method and I try to use it in my project.

But when I ran the codes from example.R file, I noticed that sketch.indices can start from 0 and end at 999 (if there are 1000 cells in total). I guess this is because it uses reticulate to call the python codes which start from 0.

I wonder whether this is designed on purpose or not. I think many R users will think the index starts from 1 by default, but if they use these indices directly for indexing matrices, completely different cells will be chosen.

How could results of geosketch be replicatable?

Hey there,

I'm trying to generate identical results by using geosketch. But it seems that the result is hard to be replicated. I used set.seed() but it did not work. How could I make it?

Joey

FutureWarning: The sklearn.cluster.k_means_ module is deprecated in version 0.22 and will be removed in version 0.24.

/usr/lib/python3.8/site-packages/sklearn/utils/deprecation.py:143: FutureWarning: The sklearn.cluster.k_means_ module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.cluster. Anything that cannot be imported from sklearn.cluster is now part of the private API.
  warnings.warn(message, FutureWarning)

Affects https://github.com/brianhie/geosketch/blob/ec93a46654adc161e40ba5329c232b17a92adf68/geosketch/kmeanspp.py

I would fix, but since you import * it's hard to know exactly what's coming from that module.

Missing MatrixMarket headers

This isn't an issue with the code per se, but just an FYI of a minor problem with the associated data. All of the mtx files in the mouse_brain/dropviz subdirectories are missing the required MatrixMarket header. As a result, scipy fails to parse them:

cerebellum = np.array(scipy.io.mmread("../../../../data/geosketch/mouse_brain/dropviz/Cerebellum_ALT/matrix.mtx").toarray())
Traceback (most recent call last):
File "", line 1, in
File "/Users/wnoble/anaconda2/lib/python2.7/site-packages/scipy/io/mmio.py", line 75, in mmread
return MMFile().read(source)
File "/Users/wnoble/anaconda2/lib/python2.7/site-packages/scipy/io/mmio.py", line 416, in read
self._parse_header(stream)
File "/Users/wnoble/anaconda2/lib/python2.7/site-packages/scipy/io/mmio.py", line 480, in _parse_header
self.class.info(stream)
File "/Users/wnoble/anaconda2/lib/python2.7/site-packages/scipy/io/mmio.py", line 233, in info
[asstr(part.strip()) for part in line.split()]
ValueError: need more than 3 values to unpack

Label count mismatch

I am seeing a mismatch between the count of rarest cell type reported in the paper and the one in the data tar file. The paper says

"In particular, we focused on 28 293T cells (0.66% of the total number of cells in the dataset) in the 293T/Jurkat mixture, 262 dendritic cells (0.38%) in the PBMC dataset, 1,695 macrophages (0.25%) among the adult mouse brain cells, and 2,777 ependymal cells (0.60%) among the mouse CNS cells."

I changed into the data/geosketch/cell_labels directory and had no problem verifying the numbers for the first two data sets:

bash-4.1$ grep 293t jurkat_293t_99_1_clusters.txt | wc -l
28
bash-4.1$ grep Dendritic pbmc_68k_cluster.txt | wc -l
262

However, for the second two data set the numbers do not agree. For mouse brain (Saunders) we get 1866 macrophages instead of the expected 1695:

bash-4.1$ grep Macrophage mouse_brain_cluster.txt | wc -l
1866

For the mouse CNS cells (Zeisel) we get 2974 instead of the expected 2777:

bash-4.1$ grep Ependymal zeisel_cluster.txt | wc -l
2974

Are the numbers in the paper incorrect?

Bug, Non-random output - same observation sampled repeatedly in subsequent calls to gs() with same sketch size

Expected behavior
During subsequent runs, I would expect to find different random seeds being used making it rare to see the same cell in multiple samples from adata.

Observed behavior
For several sketch sizes, there is always one index of adata returned in subsequent calls to gs(). The index changes depending on the sketch size, but is always the same when calling gs() with a given sketch size. In this dataset, every sketch of size N=4 includes observation 25419. Similarly, every sketch of size N=10 includes observation 17479.

Steps to reproduce:

Load data

from geosketch import gs
import numpy as np
import scprep
import scanpy as sc
import os
download_path = os.path.expanduser("~/scratch")
download_path = os.path.join(download_path,'Wagner2018.processed.h5ad')
URL = 'https://ndownloader.figshare.com/files/25687247?private_link=f194ae7d6bcec9bd11a3'
scprep.io.download.download_url(URL, download_path)
adata = sc.read_h5ad(download_path)
sc.tl.pca(adata)

Draw samples of size four. Count number of times each index is observed

n_obs_subsample = 4

indices = []
for i in range(10):
    indices.append(gs(adata.obsm['X_pca'], n_obs_subsample, replace=False))

indices = np.hstack(indices)
indices, counts = np.unique(indices, return_counts=True)

returns

array([ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  3,  1,  1,  1,  1,  2,  1,  1, 10])

This is true even when explicitly setting different seeds:

brianhie / geosketch Goto Github PK

geosketch's People

Contributors

Stargazers

Watchers

Forkers

geosketch's Issues

Pandas DataFrame as input

Not really the rarest cell types?

Determining optimal sampling rate

Index issue of the R wrapper

How could results of geosketch be replicatable?

FutureWarning: The sklearn.cluster.k_means_ module is deprecated in version 0.22 and will be removed in version 0.24.

Missing MatrixMarket headers

Label count mismatch

Bug, Non-random output - same observation sampled repeatedly in subsequent calls to gs() with same sketch size

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent