Giter VIP home page Giter VIP logo

exfi's People

Contributors

jlanga avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

exfi's Issues

Follow 10 simple rules for making reasearch software more robust

  1. Use version control.
  • Put everything that you write or make into version control as soon as it is created.
  • Use a feature branch workflow.
  1. Document your code and usage.
  • [] Write a good README file.
  • [] Print usage information.
  1. Make common operations easy to control.
  • [] Allow the most commonly changed parameters to be configured from the command line.
  • [] Check that all input values are in a reasonable range at startup.
  • [] Choose reasonable defaults where they exist.
  • [] Set no defaults at all when there aren’t any reasonable ones.
  1. Version your releases.
  • [] Increment your version number every time you release your software to other people.
  • [] Make the version of your software easily available by supplying --version or -v on the command line.
  • [] Include the version number in in all of the program’s output.
  • [] Ensure that old released versions continue to be available.
  1. Reuse software (within reason).
  • [] Make sure that you really need the auxiliary program.
  • [] Ensure the appropriate software and version is available.
  • [] Ensure that reused software is robust.
  1. Rely on build tools and package managers for installation.
  • [] Document all dependencies in a machine-readable form.
  • [] Avoid depending on scripts and tools which are not available as packages.
  1. Do not require root or other special privileges to install or run.
  • [] Do not require root privileges to set up or use packages.
  • [] Allow packages to be installed in an arbitrary location.
  1. Ask another person to try and build your software before releasing it.
  • [] Eliminate hard-coded paths.
  • [] Set the names and locations of input and output files as command-line parameters.
  1. Do not require users to navigate to a particular directory to do their work.
  • [] Include a small test set that can be run to ensure the software is actually working.
  • [] Make the tests easy to find and run.
  • [] Make the test script’s output easy to interpret.
  1. Produce identical results when given identical inputs.
  • [] Echo all parameters and software versions to standard out or a log file alongside the results.
  • [] Produce the same results each time the same version of the program is run with the same inputs.
  • [] Allow the user to optionally provide the random seed as an input parameter.
  • [] Make sure acceptable tolerances are known and detailed in documentation and tests.

Compress biobloommaker files as they come out

biobloommaker returns the files categories_multimatch.fa (empty), categories_noMatch.fa (huge) and categories_transcriptome.fa (not so huge, but big).

These files can be compressed to .gz as they come out by creating a FIFO in their path, and setting gzip in the background.

Or just compress as soon as biobloommaker finishes.

Compress the resulting Bloom filters

Bloom filters in this context are mostly zeros.

Using pigz -9 I was able to go from 30Gb to 70Mb in a BF with FPR of 0.01%.

I think also usign gzip instead of pigz will make it even smaller.

Also: try bgzip or xz

make exfi a single tool

exfi build_baited_bloom_filter : build the baited bloom filter from biobloom|abyss
exfi build_splice_graph :
exfi gfa1_to_exons :
exfi gfa1_to_gapped_transcripts :

use sealer to fill gaps left by SNPs

Sealer can be used to try to fill small gaps of 1-5 nt.

To do that, I should make a script to

  1. paste exons with Ns,
  2. run sealer and
  3. split again into exons.

read bloom filters twice in build_splice_graph

Instead of providing once the bloom filter path, provide it twice, one for initial exon prediction, and a second one (if necessary) for sealer. This way, we can pipe it from gzip and save space, and maybe time.

Separate filter building and read classification

Currently, build_baited_bloom_filter does both the filter building and read classification (despite its name). Even though filter building is quite fast, it would be prudent to be able to re-use one filter for multiple read datasets. Perhaps add an option to only build the filter, and allow to specify a pre-existing filter?

Some file and tool names in README are wrong

In the README under "Running the pipeline":

2.:

  • build_splicegraph should be build_splice_graph
  • genome_k25_m100M_l1.bloom should be genome_k25_m500M_l1.bf

3.:
Neither gfa_to_exons nor gfa_to_gapped_transcript exist. Instead, gfa1_to_fasta is there and seems to have options that modify its behaviour according to the other two tools. Amend the instructions to reflect this.

build_splice_graph

read transcriptome when necessary, instead of storing the file the entire pipeline

Improve the collapse function

collapse merges exons by exact identity.

A more successful aproach could be to map the exons to every exon, get blocks of mathces and recompose the graph.

split-apply-combine

  • Make a small filter for the complete transcriptome
  • split transcriptome into subtranscriptomes (~ 1000)
  • run build_splice_graph in parallel, without collapse
  • merge GFA1, collapse optionally

add options to trim sequences

Bloom filters are known for having false positives. In this context, the reported exons may have one or two additional bases at the start or the end of the exon. Therefore, we could give the option to the script to trim 1 or 2 nucleotides to each exon.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.