alkodsi / ctdnatools Goto Github PK

View Code? Open in Web Editor NEW

35.0 4.0 12.0 4.21 MB

R package to work with ctDNA sequencing data

Home Page: https://alkodsi.github.io/ctDNAtools/

License: Other

R 100.00%

ctdna cfdna cancer genomics liquid-biopsy sequencing mutations ngs circulating-tumor-dna cell-free-dna

ctdnatools's People

Contributors

Stargazers

Watchers

Forkers

eulertx shulp2211 ssyang145 flywind2 rtsundby nickier0510 zzygyx9119 hassanfa shiwanyin chaorongc sunbymoon chienr-roche

ctdnatools's Issues

bfs counts always 0

Hi!
All the counts from the bfs are NaN when normalized and 0 when not normalized. Below is the code I ran:

`> bfs <- c("./<path_to_bam>/1.bam") %>%
+     map(bin_fragment_size, bin_size = 5, normalized = TRUE) %>%
+     purrr::reduce(inner_join, by = "Breaks")

> tail (bfs, n=12)
    Breaks UNAFFECTED
69 341_346        NaN
70 346_351        NaN
71 351_356        NaN
72 356_361        NaN
73 361_366        NaN
74 366_371        NaN
75 371_376        NaN
76 376_381        NaN
77 381_386        NaN
78 386_391        NaN
79 391_396        NaN
80 396_400        NaN`

However, I know there are fragments as seen below

`fs1 <- get_fragment_size(bam = "/<path_to_bam>/1.bam", 
+                          mapqFilter = 30, 
+                          isProperPair = NA, 
+                          min_size = 1, 
+                          max_size = 400, 
+                          ignore_trimmed = FALSE, 
+                          simple_cigar = FALSE, 
+                          different_strands = TRUE)
> head(fs1)`
`      Sample                                                                 ID  chr    start      end size
1 UNAFFECTED   UNAFFECTED_A00450:173:HMNH2DSX2:3:2441:6560:29543:GAAGTG+AAGTCCA chr7 55242170 55242488  319
2 UNAFFECTED UNAFFECTED_A00450:173:HMNH2DSX2:4:1264:16206:35759:TAGCGTA+AGTGGTA chr7 55242186 55242365  180
3 UNAFFECTED  UNAFFECTED_A00450:173:HMNH2DSX2:3:2240:13901:14763:GTAACA+TAATGCG chr7 55242190 55242539  350
4 UNAFFECTED  UNAFFECTED_A00450:173:HMNH2DSX2:3:1204:4372:13823:GAACCTC+GCTGTCA chr7 55242197 55242511  315
5 UNAFFECTED  UNAFFECTED_A00450:173:HMNH2DSX2:3:1153:14389:34710:CGGTGTA+GTAACA chr7 55242203 55242561  359
6 UNAFFECTED   UNAFFECTED_A00450:173:HMNH2DSX2:3:2523:4453:13777:CCTCATA+AATGAG chr7 55242203 55242566  364`

Is it something I am not doing?
Thanks!
Rini

How to know the shared mutations in test_ctDNA()

Hi,
Thank you for this tools ... I have a question please , Is there a way to know the mutations shared in between the tumour and ctDNA in positive samples ??

Thank you in advance,
Best,
Esraa

permutations should count successes per substitution

Currently underestimates positivity when by_substitution = T

un-matched healthy samples and tumor fraction calculation

Dear Amjad Alkodsi, I am a bioinformatician, and I was wondering if is it possible to use un-matched healthy samples respect to the tumor samples that we sequenced. We cannot sequence normal tissue, so we were thinking of using some neonatal bam files taken online to compare our samples.

We are actually interested in calculating the tumor fraction of these samples, so If you have also suggestions, they are welcome.

Thank you very much for your availability

test_ctDNA not working with hg19 NCBI BAM file

Hi,

Thanks for a great tool! I am trying to use test_ctDNA on bam files from targeted panel sequencing data and with BAM files aligned to RefSeq: NCBI GRCH37 Reference Sequence. Even after changing the seqname style of the BSgenome.Hsapiens.UCSC.hg19 from UCSC to NCBI, I kept getting the error: "Error: Chromosomes in bam file don't match the specified reference".
I saw that the culprit chromosomes were these unlocalized sequences GL00* when I ran the get_bam_chr() function. These sequences failed to map in BSgenome when I switched from UCSC to NCBI.
Is there a way to skip the chromosomes in the BAM file that do not match the reference? Could that be added as a parameter to the test_ctDNA function?

Below is the code and the sessionInfo():

#To make BSgenome.Hsapiens.UCSC.hg19 package compatible, I first changed the seqname style as follows:
library(BSgenome.Hsapiens.UCSC.hg19)

#change ref genome seq style from Ucsc (chr) to NCBI to match the bam files 

hg19 <- BSgenome.Hsapiens.UCSC.hg19
seqlevelsStyle(hg19) <- "NCBI"

#test ctDNA

test_pos_mrd<-test_ctDNA(bam="panel_data/bam_files/P05.preop.subset.bam",mutations = mutations_df[mutations_df$patient_id=="P05",c("CHROM","POS","REF","ALT")],targets = target,reference = hg19,informative_reads_threshold = 100)

Error: Chromosomes in bam file don't match the specified reference

#check the chromosomes in BAM file
> get_bam_chr("panel_data/bam_files/P05.preop.subset.bam")
 [1] "1"          "2"          "3"          "4"          "5"         
 [6] "6"          "7"          "8"          "9"          "10"        
[11] "11"         "12"         "13"         "14"         "15"        
[16] "16"         "17"         "18"         "19"         "20"        
[21] "21"         "22"         "X"          "Y"          "MT"        
[26] "GL000207.1" "GL000226.1" "GL000229.1" "GL000231.1" "GL000210.1"
[31] "GL000239.1" "GL000235.1" "GL000201.1" "GL000247.1" "GL000245.1"
[36] "GL000197.1" "GL000203.1" "GL000246.1" "GL000249.1" "GL000196.1"
[41] "GL000248.1" "GL000244.1" "GL000238.1" "GL000202.1" "GL000234.1"
[46] "GL000232.1" "GL000206.1" "GL000240.1" "GL000236.1" "GL000241.1"
[51] "GL000243.1" "GL000242.1" "GL000230.1" "GL000237.1" "GL000233.1"
[56] "GL000204.1" "GL000198.1" "GL000208.1" "GL000191.1" "GL000227.1"
[61] "GL000228.1" "GL000214.1" "GL000221.1" "GL000209.1" "GL000218.1"
[66] "GL000220.1" "GL000213.1" "GL000211.1" "GL000199.1" "GL000217.1"
[71] "GL000216.1" "GL000215.1" "GL000205.1" "GL000219.1" "GL000224.1"
[76] "GL000223.1" "GL000195.1" "GL000212.1" "GL000222.1" "GL000200.1"
[81] "GL000193.1" "GL000194.1" "GL000225.1" "GL000192.1"

##session_info

R version 4.3.2 (2023-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Big Sur ... 10.16

Matrix products: default
BLAS/LAPACK: /Users/zahra/.conda/envs/R/lib/libopenblasp-r0.3.23.dylib;  LAPACK version 3.11.0

locale:
[1] C/UTF-8/C/C/C/C

time zone: Europe/Stockholm
tzcode source: system (macOS)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] data.table_1.14.10                GenomicAlignments_1.38.0         
 [3] SummarizedExperiment_1.32.0       Biobase_2.60.0                   
 [5] MatrixGenerics_1.14.0             matrixStats_1.2.0                
 [7] Rsamtools_2.16.0                  BSgenome.Hsapiens.UCSC.hg19_1.4.3
 [9] BSgenome_1.70.1                   rtracklayer_1.62.0               
[11] BiocIO_1.12.0                     Biostrings_2.68.1                
[13] XVector_0.40.0                    GenomicRanges_1.52.1             
[15] GenomeInfoDb_1.36.4               IRanges_2.34.1                   
[17] S4Vectors_0.38.2                  BiocGenerics_0.46.0              
[19] plyr_1.8.9                        ggplot2_3.4.4                    
[21] dplyr_1.1.4                       tidyr_1.3.0                      
[23] purrr_1.0.2                       ctDNAtools_0.4.1                 

loaded via a namespace (and not attached):
 [1] utf8_1.2.4              generics_0.1.3          SparseArray_1.2.2      
 [4] bitops_1.0-7            lattice_0.22-5          magrittr_2.0.3         
 [7] grid_4.3.2              Matrix_1.6-4            restfulr_0.0.15        
[10] fansi_1.0.6             scales_1.3.0            XML_3.99-0.16          
[13] codetools_0.2-19        abind_1.4-5             cli_3.6.2              
[16] rlang_1.1.2             crayon_1.5.2            munsell_0.5.0          
[19] DelayedArray_0.28.0     withr_2.5.2             yaml_2.3.8             
[22] S4Arrays_1.2.0          tools_4.3.2             parallel_4.3.2         
[25] BiocParallel_1.34.2     colorspace_2.1-0        GenomeInfoDbData_1.2.10
[28] assertthat_0.2.1        vctrs_0.6.5             R6_2.5.1               
[31] lifecycle_1.0.4         zlibbioc_1.46.0         pkgconfig_2.0.3        
[34] pillar_1.9.0            gtable_0.3.4            glue_1.6.2             
[37] Rcpp_1.0.11             tibble_3.2.1            tidyselect_1.2.0       
[40] rjson_0.2.21            compiler_4.3.2          RCurl_1.98-1.13

Use custom bin sizes to summarize fragment sizes by

Hello,
Many thanks for making this package available; it’s very useful and helpful.
I would like to take advantage of the summarize_fragment_size method to characterise the proportion of reads in custom bin fragment sizes. I’m struggling to see how to define these bin sizes so any pointers would be appreciated.
Thanks!

Detailed results

Hello and thank you for your work, I find it very interesting! I would like to know if it's possible to provide more details about the output of the test_ctdna() function, specifically specifying the details of the mutations detected as positive. Thank you! Alexis

NGS panel used

Thanks for the great tool.

I saw in one of your publications that the panels you used are capture based. I was wondering if ctDNAtools methods could be used with amplicon based NGS panels.

Mutation names

Hello,
I've been working with mutations in get_fragment_size and have found that mapping the output back to the original mutation in the input file makes interpretation difficult. I have been looking at get_mutations_read_names and can see that there is an ID for the mutation (chr14:106327474_C_G in the example below) which, if included in the output, could be used to relate to the input mutations.
An alternative option I thought would be to add an additional field in the mutations input e.g. "MUT.NAME" which, if it were something explanatory like "KRAS G12C", and could also be included in the resulting output would be more informative than the read IDs for downstream analyses.

I have been trying to modify get_fragment_size to allow a mutation ID to be carried through and included in the output but haven't been able to make any progress. Do you think this is feasible and would you have any suggestions please.

get_mutations_read_names(bam = bamT1, mutations = mutations[1:3, ])
#> $chr14:106327474_C_G
#> $chr14:106327474_C_G$ref
#> [1] "T1_Library1:1954851" "T1_Library1:1955625" "T1_Library1:1956206"
...

Thanks

ctDNAtools get_fragment_size() excessive memory usage

Hi,
I am trying to use ctDNAtools for my analysis and I am having my get_fragment_size() processes "killed" due to ctDNAtools exceeding the memory. I am working on a server with 256Gb memory and I am working with bam files of size 35-50Gb. Is this normal the ctDNAtools cannot handle this on such a machine?
Is there any way to limit the memry usage and prevent the processes to "crash"?
Thanks a lot for developping and maintaining such a nice tool.

Feature request: vcf parser

add a function to parse vcf into the accepted mutations df input

Producing fragment size histograms for whole genome data

Hello,

Firstly, thanks for this useful resource.

I've been trying to follow some of the examples in the 'Get Started` guide to produce fragment size histogram for my data.

I am currently running it as follows:

bfs_SBJ00037 <- bin_fragment_size(bam ="path/to/sample.bam",
          min_size = 1,
          max_size = 400,
          normalized = TRUE,
          bin_size = 10)

The bam is 60G in size and I haven't been able to get any results from the use as above. I'm running on a cluster and the job has been running for more than 12hours now.

I wanted to ask:

If I am using the program correctly or I need to first run get_fragment_size on my bam to produce the histograms?
What are the expected runtime for the program on whole genome data? I'm currently throwing 64G memory at it.
Also, are there any recommendations to decide appropriate bin size?