OncoLib is a library for reconstructing and simulating the evolution and NGS sequencing of a metastatic tumor.
OncoSim is written in C++11 and thus requires a modern C++ compiler (GCC >= 4.8.1, or Clang). In addition, OncoSim has the following dependencies.
Graphviz is required to visualize the resulting DOT files, but is not required for compilation.
In case doxygen is available, extended source code documentation will be generated.
To compile OncoSim, execute the following commands from the root of the repository:
$ mkdir build
$ cd build
$ cmake ..
$ make
In case CMake fails to detect any of the above libraries, run the following command with adjusted paths:
$ cmake .. -DCMAKE_PREFIX_PATH=/data/work/yunan/subclone/software/armadillo-8.500.1/build \
-DLIBLEMON_ROOT=/data/work/yunan/subclone/software/lemon-1.3.1/build \
-DBOOST_ROOT=/data/work/yunan/software/anaconda3/include/boost \
-DPYTHON_INCLUDE_DIR=/data/work/yunan/software/anaconda3/include/python3.6m \
-DPYTHON_LIBRARY=/data/work/yunan/software/anaconda3/lib
The compilation results in the following files in the build
directory:
EXECUTABLE | DESCRIPTION |
---|---|
simulate |
Simulate the migration and evolutionary history of a metastatic cancer. |
tree2dot |
Visualizes a given clone tree in DOT format. |
graph2dot |
Visualizes a given migration graph in DOT format. |
tree2freqs |
Extract mutation frequencies from a given clone tree. |
mix |
Generate mixtures of the leaves of a given clone tree. |
sequence |
Generate a read matrix (containing NGS read counts) from a given clone tree. |
reads2freqs |
Converts reads into mutation frequency confidence intervals. |
downsample |
Down sample a given read matrix. |
oncosim.so |
Python library. |
Below we describe the various formats used by OncoSim.
The first line of a clone tree file indicates the number t
of edges. The subsequent t
lines each encode an edge, by listing the labels of the incident vertices separated by a space or tab character. Next, the number s
of leaves is specified. The subsequent s
lines each encode a leaf. Each line is a string whose values are separated by spaces or tabs. The first value is the leaf label, followed by the location. Next, the number of samples is specified. For each sample the mixture proportion is specified. The remaining values correspond to mutations.
8 #edges
26;28 26;28_M1
25 26;28
25 25_M1
24 24_P
0;2;3;4;5;6;7;8;9;10;11;12;13;14;15;16;17;18;19;20;21;22;23 25
0;2;3;4;5;6;7;8;9;10;11;12;13;14;15;16;17;18;19;20;21;22;23 24
0;2;3;4;5;6;7;8;9;10;11;12;13;14;15;16;17;18;19;20;21;22;23 22;23_P
GL 0;2;3;4;5;6;7;8;9;10;11;12;13;14;15;16;17;18;19;20;21;22;23
4 #leaves
22;23_P P 2 1 0.850564 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
24_P P 2 0 0.149436 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
26;28_M1 M1 2 0 0.46333 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 25 26 28
25_M1 M1 2 1 0.53667 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 25
A vertex labeling assigns a location to each vertex of a clone tree. Each line contains two values, the vertex label and the location label separated by a space or tab character. For example:
GL P
0 P
1 P
1_P P
2;3;4;5;6;7;8;9;10;11;12;13;14;15;16;17;18;19;20;21;22;23 P
22;23_P P
24 P
24_P P
25_M1 M1
26;28 M1
26;28_M1 M1
25 M1
A color map file assigns a color to each location. Each line contains two values, the location label and an integer indicating the color.
P 1
M_1 2
A frequency file encodes the frequency of every mutation in a sample. It is a tab separated file. The first line lists the number of anatomical sites followed by the number of samples and then the number of mutations, each on separate lines. The fourth line is ignored but describes the format of the rest of the file. Each subsequent line encodes the cell frequency of a mutation in a sample: first the sample 0-based index is given, followed by the label of the sample, the 0-based index of the anatomical site, the anatomical site label, the 0-based index of the mutation, the label of the mutation, the frequency lower bound and upper bound.
2 #anatomical sites
4 #samples
27 #characters
#sample_index sample_label anatomical_site_index anatomical_site_label character_index character_label f- f+
0 M1_0 0 M1 0 0 0.962918 1
0 M1_0 0 M1 1 2 0.892128 1
0 M1_0 0 M1 2 3 0.870492 1
0 M1_0 0 M1 3 4 0.895341 1
0 M1_0 0 M1 4 5 0.775452 1
...
A reads file encodes the number of variant and reference reads of every mutation in each sample. It is a tab separated file. The first line lists the number of anatomical sites followed by the number of samples and then the number of mutations, each on separate lines. The fourth line is ignored but describes the format of the rest of the file. Each subsequent line encodes the variant and reference read counts of a mutation in a sample: first the sample 0-based index is given, followed by the label of the sample, the 0-based index of the anatomical site, the anatomical site label, the 0-based index of the mutation, the label of the mutation, the number of reference reands and the number of variant reads.
2 #anatomical sites
4 #samples
27 #characters
#sample_index sample_label anatomical_site_index anatomical_site_label character_index character_label ref var
0 M1_0 0 M1 0 0 72 91
0 M1_0 0 M1 1 2 83 90
0 M1_0 0 M1 2 3 77 81
0 M1_0 0 M1 3 4 75 83
0 M1_0 0 M1 4 5 83 72
...
$ ../build/simulate -h
Usage:
../build/simulate [--help|-h|-help] [-C int] [-D num] [-E num] [-K num]
[-N int] [-P num] [-c str] [-f num] [-k int] [-kP int] [-m int]
[-mig num] [-mut num] [-o str] [-p int] [-s int] [-v]
Where:
--help|-h|-help
Print a short help message
-C int
Target coverage (default: 200)
-D num
Driver probability (default: 1e-7)
-E num
Per base sequencing error rate (default: 0)
-K num
Carrying capacity (default: 5e4)
-N int
Number of successful simulations (default: -1)
-P num
Purity (default: 1)
-c str
Color map file
-f num
Mutation frequency threshold (default: 0.05)
-k int
Number of samples per anatomical site (default: 2)
-kP int
Number of samples for the primary tumor (default: 2)
-m int
Maximum number of detectable anatomical sites (default: 8)
-mig num
Migration rate (default: 1e-6)
-mut num
Mutation rate (default: 0.1)
-o str
Output directory
-p int
Allowed migration patterns:
0 : mS (default)
1 : mS, S
2 : mS, S, M
3 : mS, S, M, R
-s int
Random number generator seed (default: 0)
-v
Verbose
The Python module oncosim.so
has the following eight functions:
- simulate
- tree2dot
- graph2dot
- tree2freqs
- mix
- sequence
- reads2freqs
- downsample
These functions correspond to the standalone executables described previously. Please refer to example/example.ipynb for example usage.