Giter VIP home page Giter VIP logo

fastpg's Introduction

Introduction to FastPG

Fast phenograph-like clustering of millions of items with scores of features.

See our preprint article introducing FastPG:

Thomas Bodenheimer, Mahantesh Halappanavar, Stuart Jefferys, Ryan Gibson, Siyao Liu, Peter J Mucha, Natalie Stanley, Joel S Parker, Sara R Selitsky. FastPG: Fast clustering of millions of single cells. bioRxiv 2020.06.19.159749; doi: https://doi.org/10.1101/2020.06.19.159749

License

This package is licensed under the MIT license, except for the Grappolo C++ library by Mahantesh Halappanavar ([email protected]) which is included and licensed under the BSD-3 license as described in the headings of the relevant *.cpp files. Some alterations to the original Grappolo source have been made to support installation and use within an R package. The original Grappolo C++ library is available at https://hpc.pnl.gov/people/hala/grappolo.html

Installation

You can install locally from GitHub, or use the Docker container from DockerHub:

Local install

Before installing, you must have R, of course, but also:

  • If you are on Windows you will need the version of Rtools appropriate for the version of R (and need to set the path correctly), see https://cran.r-project.org/bin/windows/Rtools/. If you upgraded from R 3 to R 4 and have not already done so, you should uninstall the old and install the new Rtools.

  • If you are on Mac OS X, you will need the x-code command line tools and probably need to install the OpenMP library, which Apple no longer distributes as part of x-code. One way to do that is to install the data.table package as described here: https://firas.io/post/data.table_openmp/. Afterwards, FastPG and any other programs that use OpenMP should work.

You can install the FastPG package from GitHub using the Bioconductor installation manager. If you don't have Bioconductor you need to install the BiocManager and remotes packages from CRAN. Then you can install using:

# Requires the CRAN packages "remotes" and "BiocManager" to be installed.
BiocManager::install("sararselitsky/FastPG")

To install the latest code from a branch or a specific tagged version, append "@" to the repository string used above, e.g BiocManager::install("sararselitsky/FastPG@dev") or BiocManager::install("sararselitsky/[email protected]")

Use a pre-built Docker container

Another way to use FastPG is via a Docker container, jefferys/fastpg:latest, that is already set up and has the package pre-installed. You can just run it interactively, using the R command line inside it. To use the pre-built FastPG container from DockerHub you should be running on a 64 bit Intel/AMD compatible machine.

docker pull jefferys/fastpg:latest
docker run -it --rm -v $PWD:$PWD -w $PWD jefferys/fastpg:latest
R
   # or (singularity < 3.0)
singularity pull docker://jefferys/fastpg:latest
singularity shell -B $PWD --pwd $PWD -C fastpg-latest.simg
R
   # or (singularity 3.0+)
singularity pull docker://jefferys/fastpg:latest
singularity shell -B $PWD --pwd $PWD -C fastpg_latest.sif

Note that you should consider this container version like an "application" and not an environment. You may have problems installing additional packages into it. To do that, see Extending the Docker container.

Building your own Docker container

If you want to build your own container instead of pulling a pre-build one, you can use the Dockerfile included in the repository in the Docker/ directory as a guide. The build.sh file in the same directory automatically builds and tags the container with the name and tags used by the pre-built container at DockerHub, you should change the tags by editing the parameters in the build.sh file, or by manually building it and tagging it yourself.

git clone --single-branch https://github.com/sararselitsky/FastPG.git
cd FastPG/Docker
# Edit Docker tags in build.sh for your use
./build.sh --no-cache

You can just use this with Docker as above, but you can't just use a local Docker container with singularity versions less than 3.0. You have to push the container to some Docker registry before you can run it. With 3.0+ you can pull a local container directly by using docker-daemon:// instead of docker:// to get a local container.

Extending the Docker container

If you want to add additional things to the container, you should build your own. You can extend the existing container either by editing the provided Dockerfile or by using the existing container as the FROM that your own Docker container is based on.

Clustering with FastPG

Clustering is as simple as:

clusters <- FastPG::fastCluster( data, k, num_threads )

The fastCluster() function takes a number of additional tuning parameters, but those have reasonable defaults.

The data parameter

The main input is the numeric data to cluster as a matrix, where rows are elements to cluster and columns are the features that make elements similar or different. Any data matrix will do, for this example we extract a 265,627 x 32 numeric data matrix from the GitHub-published clustering benchmark mass cytometry data set, sourced from the phenograph paper [@Levine-2015].

url <- "https://github.com/lmweber/benchmark-data-Levine-32-dim/raw/master/data/Levine_32dim.fcs"
file <- "Levine_32dim.fcs"
download.file( url, file, mode="wb") # This downloads a 41.5 MB binary file
dataColumns <- c( 5:36 ) # extract only the data columns, whatever they are
data <-  flowCore::exprs( flowCore::read.FCS( file, truncate_max_range= FALSE ))
data <- data[ , dataColumns ]

The k parameter

To cluster a data set, a local neighborhood size needs to be specified as a parameter.

k <- 30

The num_threads parameter

The number of cpus to use should be specified, it defaults to 1. However, it will only limit the k nearest neighbors part of the clustering (see Internal algorithm). The rest of the clustering will use all available cpus regardless of what this is set to.

num_threads <- 4

Results

fastCluster() returns a list with two elements

  • $modularity = The modularity of the network created from the overlapping nearest neighbor graphs.
  • $communities = An integer vector where the nth element is the nth element from the input data. The value is the cluster that each input element has been assigned to. table(clusters$communities) shows a count by cluster id. Note that these id's are arbitrary and not deterministic.

Caution: -1 indicates a point that was not clustered; each can be considered their own cluster, even though they are all labeled “-1”. It is probable that you will not have any singleton clusters, but they can occur.

Internal algorithm

FastPG utilizes the same three main steps as the phenograph algorithm [@Levine-2015, @Chen-2016], but uses fast, parallel implementations.

  • The k nearest neighbors determining step is implemented using hierarchical navigable small world graphs [@Malkov-2016] via the RcppHNSW library.
  • The nearest-neighbor distances are generated using an included parallel Jaccard metric function.
  • Clustering is implemented as community detection in the graph formed from the overlapping "k best friends" for each element. This is done using a parallel Louvain algorithm as implemented by Grappolo [@Lu-2015]. Code for Grappolo has been included within this package; the standalone application with additional functionality is available for download as described in the license section above.

Calling fastCluster() is equivalent to and is essentially implemented as the following sequence of commands:

# Approximate k nearest neighbors
all_knn <- RcppHNSW::hnsw_knn( data, k= k, distance= 'l2',
                               n_threads= num_threads )

ind <- all_knn$idx

# Parallel Jaccard metric
links <- FastPG::rcpp_parallel_jce(ind)
links <- FastPG::dedup_links(links)

# Parallel Louvain clustering
clusters <- FastPG::parallel_louvain( links )

Note that RcppHNSW::hnsw_knn() and FastPG::parallel_louvain( links ) have numerous additional parameters; only the parameters used that are different from the defaults are shown above. The FastPG::fastCluster() wrapper allows setting all applicable parameters. See the function documentation for additional details.

References

Chen, Hao, Mai Chan Lau, Michael Thomas Wong, Evan W Newell, Michael Poidinger, and Jinmiao Chen. 2016. “Cytofkit: A Bioconductor Package for an Integrated Mass Cytometry Data Analysis Pipeline.” PLoS Comput Biol 12 (9).

Levine, Jacob H, Erin F Simonds, Sean C Bendall, Kara L Davis, El-ad D Amir, Michelle D Tadmor, Oren Litvin, et al. 2015. “Data-Driven Phenotypic Dissection of Aml Reveals Progenitor-Like Cells That Correlate with Prognosis.” Cell 162 (1): 184–97. https://doi.org/10.1016/j.cell.2015.05.047.

Lu, Hao, Mahantesh Halappanavar, and Ananth Kalyanaraman. 2015. “Parallel Heuristics for Scalable Community Detection.” Parallel Computing 47: 19–37. https://doi.org/https://doi.org/10.1016/j.parco.2015.03.003.

Malkov, Yury A., and D. A. Yashunin. 2016. “Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs.” CoRR abs/1603.09320. http://arxiv.org/abs/1603.09320.

fastpg's People

Contributors

jefferys avatar tom-b avatar sararselitsky avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.