scPopCorn
A python tool to do comparative analysis of mulitple single cell RNA-seq datasets.
1. Installation
$ pip install scpopcorn
2. Input scRNA-seq Data File Format
scPopCorn needs multiple single cell RNA-seq dataset as inputs. Bascially, the format looks like the following. Example data files can be found in the Data
folder.
Cell1ID | Cell2ID | Cell3ID | Cell4ID | Cell5ID | ... |
---|---|---|---|---|---|
Gene1 | 12 | 0 | 0 | 0 | ... |
Gene2 | 125 | 0 | 298 | 0 | ... |
Gene3 | 0 | 0 | 0 | 0 | ... |
... | ... | ... | ... | ... | ... |
The gourd truth labels for cells in each dataset can also be input. The format is as following
Cell1ID | Lable1 |
---|---|
Cell1ID | Lable2 |
Cell1ID | Lable3 |
Cell1ID | Lable4 |
... | .. |
3. How to use
3.1 import scpopcorn package
from scpopcorn import MergeSingleCell
from scpopcorn import SingleCellData
3.2 read in RNA-seq datasets
File1 = "../Data/Human&Mouse_Pancreas/pancreas_human.expressionMatrix.txt"
Test1 = SingleCellData()
Test1.ReadData_SeuratFormat(File1)
File2 = "../Data/Human&Mouse_Pancreas/pancreas_mouse.expressionMatrix.txt"
Test2 = SingleCellData()
Test2.ReadData_SeuratFormat(File2)
3.3 read in ground truth cell labels (this is optional)
File1T = "../Data/Human&Mouse_Pancreas/pancreas_human.CellLabels.txt"
Test1.ReadTurth(File1T, 0, 1)
File2T = "../Data/Human&Mouse_Pancreas/pancreas_mouse.CellLabels.txt"
Test2.ReadTurth(File2T, 0, 1)
3.4 normlize counts data, find highly vaiable genes, and natural logarithm of one plus of the counts data
Test1.Normalized_per_Cell()
Test1.FindHVG()
Test1.Log1P()
Test2.Normalized_per_Cell()
Test2.FindHVG()
Test2.Log1P()
3.5 combine datasets and set number of supercells for each dataset
NumSuperCell_Test1 = 50
NumSuperCell_Test2 = 50
MSingle = MergeSingleCell(Test1, Test2)
MSingle.MultiDefineSuperCell(NumSuperCell_Test1,NumSuperCell_Test2)
In this example, we define 50 supercells for each dataset. The number of super cell can be chosen as following. If you have N cells, then you can define the number of super cell M, by letting N/M between 20 and 30.
3.6 compute co-membership graph within each dataset and similarity matrix across dataset
MSingle.ConstructWithinSimiarlityMat_SuperCellLevel()
MSingle.ConstructBetweenSimiarlityMat_SuperCellLevel()
3.7 run joint partition
Estimate_NumCluster = 10 # initial guess of number of corresponding clusters, do not need to be accurate!!!
MSingle.SDP_NKcut(Estimate_NumCluster)
Estimate_NumCluster
is the initial guess of the number of sub-populations you want to find and it is just an approxiamtion.
3.8 rounding the results
NumCluster_Min = 3
NumCluster_Max = 20
CResult = MSingle.NKcut_Rounding(NumCluster_Min, NumCluster_Max)
scPopCorn will screen number of clusters from NumCluster_Min
to NumCluster_Max
and automatically find the best number of clusters in [NumCluster_Min, NumCluster_Max]
3.9 evaluate of clustering results using ground truth (this is optional)
MSingle.Evaluation(CResult)
3.10 similairty between cell subpopulations across datasets
MSingle.StatResult()
3.11 Umap plots using the results generated by scPopCorn
MSingle.Umap_Result()
3.12 ScPopCorn for sub-clusters
After see the Umap plot, you may want to further joint partition a sub-cluster. You can do something as following
ClusterID = 0
NumCluster = 3
MSingle.Deep_Partition(ClusterID, NumCluster) # deep partition for cluster 0 into 3 clusters
NumCluster_Min = 3
NumCluster_Max = 5
MSingle.SDP_Deep_Rounding(NumCluster_Min, NumCluster_Max) # find out best number of clusters for the deep partition
MSingle.Merge_Deep_Partition() # merge the new partitions to the original one
MSingle.Umap_Result() # see the new results
3.13 ouptput the results
MSingle.OutputResult("TestOut.txt")
Output results in the "TestOut.txt" file.
4. Examples and reproducible results
Jupypter notebooks of examples are provide in Reproduce
folder!!!