jlanga / exfi Goto Github PK
View Code? Open in Web Editor NEWGet exons from a transcriptome and raw genomic reads using abyss-bloom and bedtools
License: MIT License
Get exons from a transcriptome and raw genomic reads using abyss-bloom and bedtools
License: MIT License
pool_size = multiprocessing.cpu_count() * 2
pool = multiprocessing.Pool(
processes=pool_size,
initializer=start_process,
maxtasksperchild=2,
)
pool_outputs = pool.map(do_calculation, inputs)
pool.close() # no more tasks
pool.join() # wrap up current tasks
polish is taking too much memory per thread
biobloommaker returns the files categories_multimatch.fa (empty), categories_noMatch.fa (huge) and categories_transcriptome.fa (not so huge, but big).
These files can be compressed to .gz as they come out by creating a FIFO in their path, and setting gzip in the background.
Or just compress as soon as biobloommaker finishes.
The Bloom filter is written/copied to the .bf file, but the accompanying .txt file that biobloommaker also outputs seems to be absent. However, the BioBloomTools documentation notes that the filter file is useless without the info file. It would be prudent to also keep the info file around after build_baited_bloom_filter
.
Bloom filters in this context are mostly zeros.
Using pigz -9 I was able to go from 30Gb to 70Mb in a BF with FPR of 0.01%.
I think also usign gzip instead of pigz will make it even smaller.
Also: try bgzip or xz
By default in the script, only reads are added to the bloom filter.
We could give an option to include both reads in the pair.
exfi build_baited_bloom_filter : build the baited bloom filter from biobloom|abyss
exfi build_splice_graph :
exfi gfa1_to_exons :
exfi gfa1_to_gapped_transcripts :
Sealer can be used to try to fill small gaps of 1-5 nt.
To do that, I should make a script to
The module is completely silent
The threading is poor
Instead of providing once the bloom filter path, provide it twice, one for initial exon prediction, and a second one (if necessary) for sealer. This way, we can pipe it from gzip and save space, and maybe time.
Sometimes exons are contained inside other exons (ISI, A5 and A3 splicing events).
Maybe it is not necessary to do a pairwise comparison between all but inside a connected component
Currently, build_baited_bloom_filter
does both the filter building and read classification (despite its name). Even though filter building is quite fast, it would be prudent to be able to re-use one filter for multiple read datasets. Perhaps add an option to only build the filter, and allow to specify a pre-existing filter?
In the README under "Running the pipeline":
2.:
build_splicegraph
should be build_splice_graph
genome_k25_m100M_l1.bloom
should be genome_k25_m500M_l1.bf
3.:
Neither gfa_to_exons
nor gfa_to_gapped_transcript
exist. Instead, gfa1_to_fasta
is there and seems to have options that modify its behaviour according to the other two tools. Amend the instructions to reflect this.
We see that 2-4 GB are enough to process
read transcriptome when necessary, instead of storing the file the entire pipeline
collapse merges exons by exact identity.
A more successful aproach could be to map the exons to every exon, get blocks of mathces and recompose the graph.
should be faster, instead of yielding lines
Matches between consecutive exons follow a Poisson distribution. In the merge2 step, we merge exons that overlap 8 or more consecutive exons. This happens, rarely, but happens.
Bloom filters are known for having false positives. In this context, the reported exons may have one or two additional bases at the start or the end of the exon. Therefore, we could give the option to the script to trim 1 or 2 nucleotides to each exon.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.