Giter VIP home page Giter VIP logo

exfi's Introduction

Build Status Codacy Badge Codacy Badge

exfi

Get exons from a transcriptome and raw genomic reads using abyss-bloom and bedtools

Citation

exfi was published on Ecology and Evolution

Langa J, Estonba A, Conklin D. EXFI: Exon and splice graph prediction without a reference genome. Ecol Evol. 2020;00:1–14. https://doi.org/10.1002/ece3.6587

Requirements

Docker or different apt and conda packages (see installation guide).

How to install

To install other dependencies, follow the instructions from the travis files:

  1. Install packages with apt:
sudo apt install \
    autoconf build-essential bzip2 cmake curl gcc git libboost-dev libsdsl3 \
    libz-dev zlib1g
  1. Install conda, then configure channels and install
conda config --add channels conda-forge
conda config --add channels defaults
conda config --add channels r
conda config --add channels bioconda
conda install --yes abyss=2.0.1 bedtools biopython pandas pip
  1. Install biobloomtools

You may need to use sudo in the last command:

# Install biobloomtools
git clone --recursive https://github.com/bcgsc/biobloom.git && \
cd biobloom/ && \
git submodule update --init && \
git checkout 0a42916922d42611a087d4df871e424a8907896e && \
./autogen.sh && \
./configure --prefix=/usr/local/ && \
make -j 4 && \
make install
  1. Copy this repo and install it with pip:
pip install https://github.com/jlanga/exfi/archive/v1.5.6.zip

If you have access to Docker, you can create a Debian container with the following command:

docker build --rm --tag exfi:v1.5.6 github.com/jlanga/exfi-docker

More info

Required data

  • A transcriptome in fasta format (take it from Ensembl for example, or the result of a de novo transcriptome assembler like trinity, trans-abyss or oases)

  • A set of genomic reads in fastq format, paired end or not. .gz files are allowed.

Running the pipeline

  1. Make a baited Bloom filter of the genomic reads with build_baited_bloom_filter:
  • genome.fa.gz is the set of genomic reads and
  • genome_k25_m100M_l1.bf is the resulting Bloom filter, made of kmers of length 25, a size of 100 MB and the number of times of a kmer must be in the reads is 1 (levels).
# Assuming that you are in the exfi folder:
build_baited_bloom_filter \
    --input-fasta data/transcript.fa \
    --kmer 25 \
    --bloom-size 100M \
    --levels 1 \
    --threads 4 \
    --output-bloom genome_k25_m100M_l1.bf \
    data/genome.fa.gz
  1. Run build_splice_graph to get putative exons in the transcriptome.
  • data/transcript.fa is the input transcriptome,
  • genome_k25_m500M_l1.bf is the Bloom filter generated above
  • kmer length has to be the same
  • test.gfa is the resulting splice graph in GFA1 format.
build_splicegraph \
    --input-fasta data/transcript.fa \
    --input-bloom genome_k25_m100M_l1.bloom \
    --kmer 25 \
    --max-fp-bases 5 \
    --output-gfa test.gfa

This splice graph can be visualized with Bandage

Example:

H	VN:Z:1.0
S	ENSDART00000033574.5:0-216	GTAAGCCGCGGCGGTGTGTGTGTGTGTGTGTGTTCTCCGTCATCTGTGTTCTGCTGAATGATGAGGACAGACGTGTTTCTCCAGCGGAGGAAGCGTAGAGATGTTCTGCTCTCCATCATCGCTCTTCTTCTGCTCATCTTCGCCATCGTTCATCTCGTCTTCTGCGCTGGACTGAGTTTCCAGGGTTCGAGTTCTGCTCGCGTCCGCCGAGACCTC
S	ENSDART00000033574.5:216-398	GAGAATGCGAGTGAGTGTGTGCAGCCACAGTCGTCTGAGTTTCCTGAAGGATTCTTCACGGTGCAGGAGAGGAAAGATGGAGGAATCCTGATTTACTTCATGATCATCTTCTACATGCTGCTGTCCGTCTCCATCGTGTGTGATGAATATTTTCTGCCATCTCTGGAGGTCATCAGCGAGCG
S	ENSDART00000033574.5:397-482	GTCTTGGTCTCTCGCAGGATGTTGCTGGAGCCACGTTTATGGCTGCGGGGAGTTCGGCTCCAGAGCTCGTCACTGCATTTCTGGG
S	ENSDART00000033574.5:480-586	GGTGTGTTTGTGACGAAGGGCGACATCGGCGTCAGCACCATCATGGGTTCTGCTGTCTATAACCTGCTGTGCATCTGTGCAGCGTGCGGCCTGCTGTCCTCTGCAG
S	ENSDART00000033574.5:585-687	GTTGGTCGTCTGAGCTGCTGGCCGTTGTTCAGAGATTGTGTTGCGTACTCCATCAGTGTCGCCGCCGTCATCGCCATCATCTCAGATAACAGAGTTTACTGG
S	ENSDART00000033574.5:685-969	GGTATGATGGCGCGTGTCTCCTGCTGGTGTACGGTGTGTATGTAGCTGTACTGTGTTTCGATCTGAAGATCAGCGAGTACGTGATGCAGCGCTTCAGTCCATGCTGCTGGTGTCTGAAACCTCGCGATCGTGACTCAGGCGAGCAGCAGCCTCTAGTGGGCTGGAGTGACGACAGCAGCCTGCGGGTCCAGCGCCGTTCCAGAAATGACAGCGGAATATTCCAGGATGATTCTGGATATTCACATCTATCGCTCAGCCTGCACGGACTCAACGAAATCAGCGAC
S	ENSDART00000033574.5:969-1177	GAGCACAAGAGTGTGTTCTCCATGCCGGATCACGATCTGAAGCGAATCCTGTGGGTTTTGTCTCTTCCGGTCAGCACTCTGCTGTTTGTGAGCGTTCCCGACTGCAGGAGACCCTTCTGGAAGAACTTCTACATGCTGACCTTCCTGATGTCCGCCGTCTGGATTTCTGCATTCACTTATGTGCTGGTCTGGATGGTCACAATCGTGG
S	ENSDART00000033574.5:1176-1283	GGGGAGACTCTGGGAATCCCGGACACAGTGATGGGAATGACTCTTCTGGCTGCAGGAACCAGTATCCCCGACACCGTGGCCAGTGTGATGGTGGCACGAGAAGGTAA
S	ENSDART00000033574.5:1277-2002	AGGTAAATCTGATATGGCCATGTCCAACATCGTGGGCTCTAACGTGTTCGATATGCTGTGTCTGGGCCTGCCGTGGTTCATCCAGACGGTGTTTGTTGACGTGGGCTCCCCGGTGGATGTCAACAGCTCGGGGCTGGTCTTCATGTCCTGCACGCTGCTGCTCTCCATCATCTTCCTCTTCCTCGCCGTGCACATCAACGGCTGGAAGCTGGACTGGAAGCTGGGTCTGGTGTGTTTGGCGTGTTACATTCTGTTCGCAACACTCTCCATCCTGTACGAGCTCGGCATCATCGGGAACAATCCCATACGCTCCTGCAGCGACTGAACACTGCTCTACAGCGCCCCCTTATGGACAACACAAGGACGTGACTCTTTATAACCCTCTAAAGTGCACAGGTTCATTACTGAATACAAGAAAATAGAACTGCGAGACGTCAACTCAAAATACAAGAGAAGTCAAAGTGCGAGATGTAAAAAATATATGCACATAAATGAGGATAAACTTTTTATTTAATAAGACAAAACTGCATAAAGTCTGATGTGAACACTGCTCAACAGCGCCCTCTCATGGACAACACATGGATCTGACTCTTATTAACCCTCCAGAGTGCAAATACACTAACACAACGTAATATAACCAAGTTAAAATGGCAAGATGTGAACTCAAAATACAAGAAAGCAGTCAAGATGCCCGACATAACAAATGTGCATTAAAATGTAAGCCC
L	ENSDART00000033574.5:0-216	+	ENSDART00000033574.5:216-398	+	0M
L	ENSDART00000033574.5:216-398	+	ENSDART00000033574.5:397-482	+	1M
L	ENSDART00000033574.5:397-482	+	ENSDART00000033574.5:480-586	+	2M
L	ENSDART00000033574.5:480-586	+	ENSDART00000033574.5:585-687	+	1M
L	ENSDART00000033574.5:585-687	+	ENSDART00000033574.5:685-969	+	2M
L	ENSDART00000033574.5:685-969	+	ENSDART00000033574.5:969-1177	+	0M
L	ENSDART00000033574.5:969-1177	+	ENSDART00000033574.5:1176-1283	+	1M
L	ENSDART00000033574.5:1176-1283	+	ENSDART00000033574.5:1277-2002	+	6M
C	ENSDART00000033574.5	+	ENSDART00000033574.5:0-216	+	0	216M
C	ENSDART00000033574.5	+	ENSDART00000033574.5:216-398	+	216	182M
C	ENSDART00000033574.5	+	ENSDART00000033574.5:397-482	+	397	85M
C	ENSDART00000033574.5	+	ENSDART00000033574.5:480-586	+	480	106M
C	ENSDART00000033574.5	+	ENSDART00000033574.5:585-687	+	585	102M
C	ENSDART00000033574.5	+	ENSDART00000033574.5:685-969	+	685	284M
C	ENSDART00000033574.5	+	ENSDART00000033574.5:969-1177	+	969	208M
C	ENSDART00000033574.5	+	ENSDART00000033574.5:1176-1283	+	1176	107M
C	ENSDART00000033574.5	+	ENSDART00000033574.5:1277-2002	+	1277	725M
P	ENSDART00000033574.5	ENSDART00000033574.5:0-216+,ENSDART00000033574.5:216-398+,ENSDART00000033574.5:397-482+,ENSDART00000033574.5:480-586+,ENSDART00000033574.5:585-687+,ENSDART00000033574.5:685-969+,ENSDART00000033574.5:969-1177+,ENSDART00000033574.5:1176-1283+,ENSDART00000033574.5:1277-2002+	*
  1. Get exonic sequences

To extract meaningful information from the GFA file, we provide two scripts:

  • gfa_to_exons: which returns the predicted exons in FASTA format. For each record, each sequence comes with a unique identifier (EXON[0-9]+), a description indicating the coordinates of this exon with respect to the different transcripts (TR1:0-200 TR2:105-305), and the sequence of nucleotides. It is possible to hard and soft mask nucleotides that may not be correct. Example (soft masked):
>EXON00000000003 ENSDART00000033574.5:397-482
gTCTTGGTCTCTCGCAGGATGTTGCTGGAGCCACGTTTATGGCTGCGGGGAGTTCGGCTC
CAGAGCTCGTCACTGCATTTCTGgg
>EXON00000000004 ENSDART00000033574.5:480-586
ggTGTGTTTGTGACGAAGGGCGACATCGGCGTCAGCACCATCATGGGTTCTGCTGTCTAT
AACCTGCTGTGCATCTGTGCAGCGTGCGGCCTGCTGTCCTCTGCAg
  • gfa_to_gapped_transcript: which returns the transcript with interleaved Ns where it is predicted to be an intron. Example (hard masked):
>ENSDART00000033574.5 EXON00000000001,EXON00000000002,EXON00000000003,EXON00000000004,EXON00000000005,EXON00000000006,EXON00000000007,EXON00000000008,EXON00000000009
GTAAGCCGCGGCGGTGTGTGTGTGTGTGTGTGTTCTCCGTCATCTGTGTTCTGCTGAATG
ATGAGGACAGACGTGTTTCTCCAGCGGAGGAAGCGTAGAGATGTTCTGCTCTCCATCATC
GCTCTTCTTCTGCTCATCTTCGCCATCGTTCATCTCGTCTTCTGCGCTGGACTGAGTTTC
CAGGGTTCGAGTTCTGCTCGCGTCCGCCGAGACCTCNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNGAGAATGCGAGTGAGTGTGTGCAGCCACAGTCGTCTGAGTTTCC
TGAAGGATTCTTCACGGTGCAGGAGAGGAAAGATGGAGGAATCCTGATTTACTTCATGAT
CATCTTCTACATGCTGCTGTCCGTCTCCATCGTGTGTGATGAATATTTTCTGCCATCTCT
GGAGGTCATCAGCGAGCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNT
CTTGGTCTCTCGCAGGATGTTGCTGGAGCCACGTTTATGGCTGCGGGGAGTTCGGCTCCA
GAGCTCGTCACTGCATTTCTGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNTGTGTTTGTGACGAAGGGCGACATCGGCGTCAGCACCATCATGGGTTCTGCTGTC
TATAACCTGCTGTGCATCTGTGCAGCGTGCGGCCTGCTGTCCTCTGCANNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTTGGTCGTCTGAGCTGCTGGCCGTTGTTCA
GAGATTGTGTTGCGTACTCCATCAGTGTCGCCGCCGTCATCGCCATCATCTCAGATAACA
GAGTTTACTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTATGATG
GCGCGTGTCTCCTGCTGGTGTACGGTGTGTATGTAGCTGTACTGTGTTTCGATCTGAAGA
TCAGCGAGTACGTGATGCAGCGCTTCAGTCCATGCTGCTGGTGTCTGAAACCTCGCGATC
GTGACTCAGGCGAGCAGCAGCCTCTAGTGGGCTGGAGTGACGACAGCAGCCTGCGGGTCC
AGCGCCGTTCCAGAAATGACAGCGGAATATTCCAGGATGATTCTGGATATTCACATCTAT
CGCTCAGCCTGCACGGACTCAACGAAATCAGCGACNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNGAGCACAAGAGTGTGTTCTCCATGCCGGATCACGATCTGAAGCGA
ATCCTGTGGGTTTTGTCTCTTCCGGTCAGCACTCTGCTGTTTGTGAGCGTTCCCGACTGC
AGGAGACCCTTCTGGAAGAACTTCTACATGCTGACCTTCCTGATGTCCGCCGTCTGGATT
TCTGCATTCACTTATGTGCTGGTCTGGATGGTCACAATCGTGNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNGGGAGACTCTGGGAATCCCGGACACAGTGATGGGAA
TGACTCTTCTGGCTGCAGGAACCAGTATCCCCGACACCGTGGCCAGTGTGATGGTGGCAC
GAGANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNATCT
GATATGGCCATGTCCAACATCGTGGGCTCTAACGTGTTCGATATGCTGTGTCTGGGCCTG
CCGTGGTTCATCCAGACGGTGTTTGTTGACGTGGGCTCCCCGGTGGATGTCAACAGCTCG
GGGCTGGTCTTCATGTCCTGCACGCTGCTGCTCTCCATCATCTTCCTCTTCCTCGCCGTG
CACATCAACGGCTGGAAGCTGGACTGGAAGCTGGGTCTGGTGTGTTTGGCGTGTTACATT
CTGTTCGCAACACTCTCCATCCTGTACGAGCTCGGCATCATCGGGAACAATCCCATACGC
TCCTGCAGCGACTGAACACTGCTCTACAGCGCCCCCTTATGGACAACACAAGGACGTGAC
TCTTTATAACCCTCTAAAGTGCACAGGTTCATTACTGAATACAAGAAAATAGAACTGCGA
GACGTCAACTCAAAATACAAGAGAAGTCAAAGTGCGAGATGTAAAAAATATATGCACATA
AATGAGGATAAACTTTTTATTTAATAAGACAAAACTGCATAAAGTCTGATGTGAACACTG
CTCAACAGCGCCCTCTCATGGACAACACATGGATCTGACTCTTATTAACCCTCCAGAGTG
CAAATACACTAACACAACGTAATATAACCAAGTTAAAATGGCAAGATGTGAACTCAAAAT
ACAAGAAAGCAGTCAAGATGCCCGACATAACAAATGTGCATTAAAATGTAAGCCC

Attributions

Written by Jorge Langa

Bibliography

exfi's People

Contributors

jlanga avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

exfi's Issues

use sealer to fill gaps left by SNPs

Sealer can be used to try to fill small gaps of 1-5 nt.

To do that, I should make a script to

  1. paste exons with Ns,
  2. run sealer and
  3. split again into exons.

make exfi a single tool

exfi build_baited_bloom_filter : build the baited bloom filter from biobloom|abyss
exfi build_splice_graph :
exfi gfa1_to_exons :
exfi gfa1_to_gapped_transcripts :

Compress biobloommaker files as they come out

biobloommaker returns the files categories_multimatch.fa (empty), categories_noMatch.fa (huge) and categories_transcriptome.fa (not so huge, but big).

These files can be compressed to .gz as they come out by creating a FIFO in their path, and setting gzip in the background.

Or just compress as soon as biobloommaker finishes.

Separate filter building and read classification

Currently, build_baited_bloom_filter does both the filter building and read classification (despite its name). Even though filter building is quite fast, it would be prudent to be able to re-use one filter for multiple read datasets. Perhaps add an option to only build the filter, and allow to specify a pre-existing filter?

build_splice_graph

read transcriptome when necessary, instead of storing the file the entire pipeline

Compress the resulting Bloom filters

Bloom filters in this context are mostly zeros.

Using pigz -9 I was able to go from 30Gb to 70Mb in a BF with FPR of 0.01%.

I think also usign gzip instead of pigz will make it even smaller.

Also: try bgzip or xz

split-apply-combine

  • Make a small filter for the complete transcriptome
  • split transcriptome into subtranscriptomes (~ 1000)
  • run build_splice_graph in parallel, without collapse
  • merge GFA1, collapse optionally

Improve the collapse function

collapse merges exons by exact identity.

A more successful aproach could be to map the exons to every exon, get blocks of mathces and recompose the graph.

Follow 10 simple rules for making reasearch software more robust

  1. Use version control.
  • Put everything that you write or make into version control as soon as it is created.
  • Use a feature branch workflow.
  1. Document your code and usage.
  • [] Write a good README file.
  • [] Print usage information.
  1. Make common operations easy to control.
  • [] Allow the most commonly changed parameters to be configured from the command line.
  • [] Check that all input values are in a reasonable range at startup.
  • [] Choose reasonable defaults where they exist.
  • [] Set no defaults at all when there aren’t any reasonable ones.
  1. Version your releases.
  • [] Increment your version number every time you release your software to other people.
  • [] Make the version of your software easily available by supplying --version or -v on the command line.
  • [] Include the version number in in all of the program’s output.
  • [] Ensure that old released versions continue to be available.
  1. Reuse software (within reason).
  • [] Make sure that you really need the auxiliary program.
  • [] Ensure the appropriate software and version is available.
  • [] Ensure that reused software is robust.
  1. Rely on build tools and package managers for installation.
  • [] Document all dependencies in a machine-readable form.
  • [] Avoid depending on scripts and tools which are not available as packages.
  1. Do not require root or other special privileges to install or run.
  • [] Do not require root privileges to set up or use packages.
  • [] Allow packages to be installed in an arbitrary location.
  1. Ask another person to try and build your software before releasing it.
  • [] Eliminate hard-coded paths.
  • [] Set the names and locations of input and output files as command-line parameters.
  1. Do not require users to navigate to a particular directory to do their work.
  • [] Include a small test set that can be run to ensure the software is actually working.
  • [] Make the tests easy to find and run.
  • [] Make the test script’s output easy to interpret.
  1. Produce identical results when given identical inputs.
  • [] Echo all parameters and software versions to standard out or a log file alongside the results.
  • [] Produce the same results each time the same version of the program is run with the same inputs.
  • [] Allow the user to optionally provide the random seed as an input parameter.
  • [] Make sure acceptable tolerances are known and detailed in documentation and tests.

add options to trim sequences

Bloom filters are known for having false positives. In this context, the reported exons may have one or two additional bases at the start or the end of the exon. Therefore, we could give the option to the script to trim 1 or 2 nucleotides to each exon.

read bloom filters twice in build_splice_graph

Instead of providing once the bloom filter path, provide it twice, one for initial exon prediction, and a second one (if necessary) for sealer. This way, we can pipe it from gzip and save space, and maybe time.

Some file and tool names in README are wrong

In the README under "Running the pipeline":

2.:

  • build_splicegraph should be build_splice_graph
  • genome_k25_m100M_l1.bloom should be genome_k25_m500M_l1.bf

3.:
Neither gfa_to_exons nor gfa_to_gapped_transcript exist. Instead, gfa1_to_fasta is there and seems to have options that modify its behaviour according to the other two tools. Amend the instructions to reflect this.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.