brendelgroup / tsrchitect Goto Github PK
View Code? Open in Web Editor NEWPromoter identification from diverse types of large-scale TSS profiling data
License: GNU General Public License v3.0
Promoter identification from diverse types of large-scale TSS profiling data
License: GNU General Public License v3.0
mergeSampleData seems to take a long time to run, and for TSSs it appears to not add TSSs that are overlapping.
The speed of this for TSS merging could be greatly increased, as well as the behavior being more of what is expected, by using the dplyr library.
merged <- rbind(sample.1, sample.2) %>%
group_by(seq, TSS, strand) %>%
summarize(nTAGs = sum(nTAGs)) %>%
arrange(seq, TSS) %>%
as.data.frame()
Where sample.1 and sample.2 are two hypothetical data frames stored in the @tssCountData slot. This will output a data frame that is position sorted, and where any overlapping TSSs are summed. It should only take a few seconds to run as well.
Would be handy to have an explicit argument, such as outputDir
, to specify where you want the text files to be written if writeTable = TRUE
in either processTSS
or determineTSR
. It's often a little more convenient than having to change the working directory for each of the commands.
After filtering, TSRchitect provides you a list of TSSs and the number of tags for each position. It would be great to have a convenience function that can turn this information into a bedgraph for visualization in a genome viewer. It should perhaps output a separate bedgraph for both the + and - strand. I've attached example files below from the latest STRIPE-seq run with yeast.
The generated list of TSRs already has the information required to output a bed file for TSR visualization and downstream analysis (such as motif finding). It would be handy to have a function to generate it for you, instead of having to do it by hand every time. You could also perhaps include the score column to shade TSR "strength" in a genome viewer, such as scaling nTAGs between 0-1000.
I've included an example from the list of TSRs Taylor generated for the latest STRIPE-seq run.
TSRs_scored.bed.zip
After testing demo-RAMPAGEp.R on the most recent codebase using the IRBB7 data, I found that the resulting tssExperiment object contains 3 (instead of the expected 2) slots in @tssCountDataMerged and @tsrDataMerged. I have not isolated the error yet, but am checking mergeSampleData.R first.
The latest TSR output has a few categories describing the attributes of a TSR. The following don't appear to have documentation explaining what they represent, and how they are derived.
The last accepted pull request led to a successful build of our Singularity container, but this was not the easiest of fixes. ngsutils has not been maintained for a few years (in its original package) and getting it to compile needed a peculiar fix (the code was compiled in a virtual python environment, but their latest commit had the necessary cython not be put into that environment, causing fatal compilation errors).
I suggest: 1) to freeze the current Singularity recipe/container with appropriate version labeling; and 2) consider building a new container with updated 3rd party software.
This report is for for latest github version of TSRchitect.
processTSS seems to be using a large amount of memory. I had 6 fairly small BAM files loaded into a TSRchitect object. When I went to run processTSS using 4 cores I got the following error from the (carbonate) resource manager.
=>> PBS: job killed: vmem 101773303808 exceeded limit 34359738368
Error in serialize(data, node$con, xdr = FALSE) :
Java called System.exit(143) requesting R to quit - trying to recover
I had given myself 32GB of memory for my interactive session, but it looks like TSRchitect went up to about 100GB before I got booted.
Bam file sizes
-rw-r--r-- 1 rpolicas biol 57M Mar 24 19:25 S288C_diamide_1_Aligned.out_cleaned.bam
-rw-r--r-- 1 rpolicas biol 59M Mar 24 19:25 S288C_diamide_2_Aligned.out_cleaned.bam
-rw-r--r-- 1 rpolicas biol 47M Mar 24 19:25 S288C_diamide_3_Aligned.out_cleaned.bam
-rw-r--r-- 1 rpolicas biol 76M Mar 24 19:25 S288C_WT_1_Aligned.out_cleaned.bam
-rw-r--r-- 1 rpolicas biol 86M Mar 24 19:25 S288C_WT_2_Aligned.out_cleaned.bam
-rw-r--r-- 1 rpolicas biol 76M Mar 24 19:25 S288C_WT_3_Aligned.out_cleaned.bam
Session Info
> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server 7.5 (Maipo)
Matrix products: default
BLAS/LAPACK: /gpfs/home/r/p/rpolicas/Carbonate/.conda/envs/tsrchitect-dev/lib/R/lib/libRblas.so
locale:
[1] C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] TSRchitect_1.8.9
loaded via a namespace (and not attached):
[1] Rcpp_1.0.1 compiler_3.5.1
[3] BiocManager_1.30.4 later_0.8.0
[5] GenomeInfoDb_1.18.1 XVector_0.22.0
[7] AnnotationHub_2.14.2 bitops_1.0-6
[9] tools_3.5.1 zlibbioc_1.28.0
[11] digest_0.6.18 bit_1.1-14
[13] lattice_0.20-38 RSQLite_2.1.1
[15] memoise_1.1.0 Matrix_1.2-16
[17] DelayedArray_0.8.0 shiny_1.2.0
[19] DBI_1.0.0 yaml_2.2.0
[21] parallel_3.5.1 rJava_0.9-10
[23] GenomeInfoDbData_1.2.0 rtracklayer_1.42.1
[25] httr_1.4.0 gtools_3.8.1
[27] XLConnectJars_0.2-15 Biostrings_2.50.2
[29] S4Vectors_0.20.1 IRanges_2.16.0
[31] grid_3.5.1 stats4_3.5.1
[33] bit64_0.9-7 Biobase_2.42.0
[35] R6_2.4.0 AnnotationDbi_1.44.0
[37] XML_3.98-1.19 BiocParallel_1.16.2
[39] blob_1.1.1 magrittr_1.5
[41] Rsamtools_1.34.0 matrixStats_0.54.0
[43] promises_1.0.1 htmltools_0.3.6
[45] BiocGenerics_0.28.0 GenomicRanges_1.34.0
[47] GenomicAlignments_1.18.1 XLConnect_0.2-15
[49] SummarizedExperiment_1.12.0 mime_0.6
[51] interactiveDisplayBase_1.20.0 xtable_1.8-3
[53] httpuv_1.5.0 RCurl_1.95-4.12
Right now you only allow a file input for the sample sheet. It would be useful to also allow a data frame as input. This would make it a lot easier to incorporate TSRchitect into magrittr pipes.
At the moment information about the experiments are input manually into loadTSSobj. This can be somewhat inconvenient if you are processing a large number of files. Also, as stated in the documentation, you need to be careful to match the order with the bam files.
A few packages let you upload sample information from a separate tab delimited file or data.frame object. This ensures that sample information matches with the associated bam file, and that its convenient to process a large number of files at the same time.
A good example of this is the DiffBind package.
TR to identify and fix this error, which is either in tsrToDF or just 'upstream'.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.