Giter VIP home page Giter VIP logo

benchmark-data-levine-32-dim's Introduction

Clustering benchmark data: 32-dimensional data set from Levine et al. (2015)

This repository contains R code to prepare benchmark data set Levine_32dim, which can be used to test clustering algorithms.

The data set is a 32-dimensional mass cytometry (CyTOF) data set, consisting of protein expression levels for n = 265,627 cells, p = 32 protein markers (dimensions), and k = 14 manually gated cell populations (clusters), from h = 2 individuals. Cluster labels are available for 39% (104,184) of the cells. For more details see below.

The repository benchmark-data-Levine-13-dim contains R code to prepare a second benchmark data set with lower dimensionality (13 dimensions).

The data set is sourced from the following paper:

Raw data can be accessed through Cytobank:

If you use these data sets, please reference the paper by Levine et al. (2015).

Background

Mass cytometry

Mass cytometry (also known as CyTOF) is a new technology for high-throughput single-cell analysis, similar to flow cytometry but measuring a greater number of parameters per cell.

As in flow cytometry, data sets consist of expression levels of a set of protein markers for each cell. Currently, mass cytometry systems can measure hundreds of cells per second, and around 40 protein expression levels per cell. Typical data sets contain hundreds of thousands of cells per sample.

Protein expression levels can be used to characterize cell types, known as populations, and functional states. Applications of mass cytometry often involve analysis of cell populations โ€” for example, detecting certain cell populations such as known disease biomarkers, or detecting cell populations in specific functional states, or comparing proportions of populations between samples.

Flow cytometry data has traditionally been analyzed by "gating", which refers to visually searching for clusters or regions of high density in a series of two-dimensional scatter plots. While this works well for low-dimensional flow cytometry data, it quickly becomes unreliable and unwieldy in higher-dimensional data sets. To address this, several research groups have recently developed algorithms for automated detection of cell populations.

Levine et al. (2015) paper

Levine et al. (2015) (reference and link above) introduced PhenoGraph, a new graph-based algorithm for detecting clusters in high-dimensional mass cytometry data, and used it to study phenotypic and functional heterogeneity of cells from patients with acute myeloid leukemia (AML).

In their paper, Levine et al. used two benchmark data sets of healthy human bone marrow cells to demonstrate the performance of PhenoGraph. Healthy human bone marrow cells contain many well-characterized immune cell populations, and the evaluations showed that PhenoGraph was able to correctly identify these.

The data sets are used in Figures 2A-B, S2A-C, and Data S1A-F (13-dimensional data set), and Figure S2D and Data S1G-I (32-dimensional data set) in the paper.

The authors have made these data sets publicly available, and we have found them to be very useful data sets for testing high-dimensional clustering algorithms.

32-dimensional benchmark data set

This is a 32-dimensional mass cytometry (CyTOF) data set, consisting of protein expression levels from healthy human bone marrow mononuclear cells (BMMCs) from two healthy individuals. (This data set is referred to as "benchmark data set 2" in Levine et al. 2015).

The data set contains n = 265,627 cells, with a dimensionality of p = 32 surface marker proteins. Manually gated cell population (cluster) labels for k = 14 major immune cell populations are available for 39% (104,184) of the cells, with the remaining 61% (161,443) labeled as "unassigned". The cells are from two individuals labeled H1 and H2. For individual H1, 72,463 cells were assigned to populations, and 118,888 cells were unassigned (total 191,351 cells). For individual H2, 31,721 cells were assigned to populations, and 42,555 cells were unassigned (total 74,276 cells).

19 of the 32 surface markers were used for manual gating. These 19 are: CD3, CD4, CD7, CD8, CD15, CD16, CD19, CD20, CD34, CD38, CD41, CD44, CD45, CD61, CD64, CD123, CD11c, CD235a/b, and HLA-DR. All 32 surface markers were used for automated detection of cell populations with PhenoGraph by Levine et al. (2015). See Levine et al. (2015), Supplemental Experimental Procedures, for more details.

13-dimensional benchmark data set

This is a 13-dimensional mass cytometry (CyTOF) data set, which consists of protein expression levels from healthy human BMMCs from one healthy individual. (This data set is referred to as "benchmark data set 1" in Levine et al. 2015).

The data set contains n = 167,044 cells, with a dimensionality of p = 13 surface marker proteins. Manually gated cell population (cluster) labels are provided for k = 24 major immune cell populations (i.e. higher resolution than the other data set). Cluster labels are available for 49% (81,747) of the cells, with the remaining 51% (85,297) labeled as "unassigned". All cells are from a single individual.

The 13 surface markers are: CD45, CD45RA, CD19, CD11b, CD4, CD8, CD34, CD20, CD33, CD123, CD38, CD90, and CD3. All 13 surface markers were used for manual gating. An additional "DNA * cell length" gating step was also applied to remove platelets. See Levine et al. (2015), Supplemental Experimental Procedures, for more details.

This repository

Purpose

We have written R code to pre-process and export these benchmark data sets in standard formats, in order to make it easier for researchers from other fields to access them to test clustering algorithms. This repository contains R code for the 32-dimensional benchmark data set, and the companion repository benchmark-data-Levine-13-dim contains code for the 13-dimensional benchmark data set.

The publicly available data files provided by Levine et al. (2015) through Cytobank are in FCS (Flow Cytometry Standard) format, with one FCS file per manually gated cell population (cluster). The FCS format is an efficient binary file format, and is the most widely used format in the flow cytometry community. However, it requires specialized software tools to access, making it relatively inaccessible for researchers from other areas.

In addition, an arcsinh transform is usually applied before performing gating or automated analysis. At high values, this is similar to a log transform, while for small values it is close to linear; this is required due to the presence of many small and negative values in CyTOF data. Standard scale factors for the arcsinh transform are 5 for CyTOF data, and 150 for flow cytometry data (see Bendall et al. 2011, Supplementary Figure S2). For flow cytometry data, other similar transforms such as the logicle or biexponential are also frequently used.

Steps

The R script in this repository performs the following steps:

  • Load the FCS files
  • Extract cell population names, protein marker names, and labels for each individual
  • Extract cluster labels (one cluster per FCS file; "unassigned" cells are labeled "NA")
  • Apply arcsinh transform (scale factor 5 for CyTOF data; see Bendall et al. 2011, Supplementary Figure S2)
  • Export data in FCS and tab-delimited TXT format (separate files with/without arcsinh transform)

Contents

The files in this repository are:

We have not included the exported TXT format files in this repository, since they are too large for a GitHub repository (>100 MB). If you need the data files in TXT format, either download the raw data files from Cytobank and run the R script from this repository, or use the R code below to directly load and convert the exported FCS files (change the filename as required):

# install flowCore package
source("https://bioconductor.org/biocLite.R")
biocLite("flowCore")

library(flowCore)

# load FCS file and save in TXT format
data <- flowCore::exprs(flowCore::read.FCS("Levine_32dim.fcs", transformation = FALSE))

head(data)
dim(data)

write.table(data, file = "Levine_32dim.txt", quote = FALSE, sep = "\t", row.names = FALSE)

References and links

The benchmark data sets are sourced from the paper by Levine et al. (2015):

Data from Levine et al. (2015) are publicly available through Cytobank at the following links. Note that a (free) Cytobank account is required.

Additional information can also be found on the Dana Pe'er lab web page, at: http://www.c2b2.columbia.edu/danapeerlab/html/phenograph.html

The 13-dimensional benchmark data set was originally published by Bendall et al. (2011):

benchmark-data-levine-32-dim's People

Contributors

lmweber avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.