Giter VIP home page Giter VIP logo

virus_detection_sra's Introduction

Gitter

SIDEARM - Your weapon for viral discovery in the NCBI SRA database

Sidearm searches the SRA database for viruses using the NCBI magicBLAST tool. It generates a table describing the number of alignments to each virus and various metrics such as the sequence coverage and average depth. The reads aligning to virus are assembled into viral contigs to attempt to generate complete viral genomes.

Installation

Clone the repository

git clone https://github.com/NCBI-Hackathons/Virus_Detection_SRA

Dependencies

Install the following:

Example Workflow

This example uses an RNA-seq dataset (SRR1553459) from an Ebola virus outbreak. After cloning this repository, do the following:

cd Virus_Detection_SRA/cwl/tools
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/848/505/GCF_000848505.1_ViralProj14703/GCF_000848505.1_ViralProj14703_genomic.fna.gz
gunzip GCF_000848505.1_ViralProj14703_genomic.fna.gz
makeblastdb -dbtype nucl -in GCF_000848505.1_ViralProj14703_genomic.fna -out ebolazaire -parse_seqids
export BLASTDB=$BLASTDB:`pwd`

These steps downloaded the Ebola virus genome and uncompressed it. Using the Ebola virus genome, a BLAST database was created with makeblastdb. Then your local directory was added to the BLASTDB environmental variable.

sidearm.cwl sidearm.SRR1553459.ebola.yml

This steps runs Sidearm and generates the following primary output files.

  • SRR1553459.ebolazaire.bam - magicblast alignments to Ebola virus (sorted BAM file)
  • SRR1553459.ebolazaire.bam.summarize.tsv - aggregate information about the alignments to each reference sequence (in this case it is only Ebola virus)
  • SRR1553459.ebolazaire.bam.fa.trim.fa.assembly.fa - the contigs generated by the assembly of all sequences that aligned to the reference sequences.

The log files for the trim and assembly modules are also created

  • SRR1553459.ebolazaire.bam.fa.trim.log - the trim module logfile
  • SRR1553459.ebolazaire.bam.fa.trim.fa.assembly.log - the assembly module logfile

Expected Results

Open the summarize.tsv file in a spreadsheet program. The number of alignments to Ebola virus should be ~15,000 (column 'aligns'), sequence coverage ~98% (column 'seqcov'), and average depth ~75 (column 'avgdepth'). The longest contig is ~12,500bp and a BLASTN search shows that it is Ebola virus.

The avg (average) fields are the average MAPQ (or Score or EditDist) across the alignments for each subject in the BAM file. MAPQ, Score and EditDist are taken from field 5, the AS flag, and the NM flag in the BAM file, respectively.

Example workflow for all NCBI RefSeq viruses

cd Virus_Detection_SRA/cwl/tools
wget ftp://ftp.ncbi.nlm.nih.gov/refseq/release/viral/viral.1.1.genomic.fna.gz
gunzip viral.1.1.genomic.fna.gz
makeblastdb -dbtype nucl -in viral.1.1.genomic.fna -out viral.1.1.genomic -parse_seqids
export BLASTDB=$BLASTDB:`pwd`

Edit sidearm.SRR1553459.ebola.yml with the following changes and save it as sidearm.SRR073726.viral11genomic.yml:

  • srr: SRR073726
  • blastdb: viral.1.1.genomic
  • path: viral.1.1.genomic.fna

Then execute Sidearm with the command line below. Depending on your computer, this will take about 1 hour.

sidearm.cwl sidearm.SRR073726.viral11genomic.yml

Expected results

Report of alignments (summarize.tsv)

id vname vlen seqcov avgdepth aligns avgMAPQ avgScore avgEditDist
NC_032111.1 BeAn 58058 virus 163005 0.78 6.7 51,012 255 22.7 0.34
NC_001357.1 HPV18 7857 22.2 130.7 26,067 255 39.1 0.05

The longest contig should be ~600bp. However, obtaining this result depends on the software and reference sequences versions that you used. A BLASTN search with the longest contig sequence should show that it is Human Papillomavirus 18.

Troubleshooting

  • Please submit an issue if you run into any problems installing or running this software.

virus_detection_sra's People

Contributors

dcgenomics avatar pcantalupo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

virus_detection_sra's Issues

sidearm crawler

Build manager to crawl through set of SRRs

  • the set of SRRs are either manually given by user or the result of SRA search criteria given by user
  • feed contigs back into a local database
  • keeps track of what has been searched and contributed

Use of uninitialized value $rlen in division (/)

Iโ€™m getting the following error with SIDEARM:

(sidearm) bash-4.2$ cwl-runner --preserve-environment BLASTDB --preserve-environment PERL5LIB /Software/sidearm/bin/cwl-runner 1.0.20170413194156
Resolved 'sidearm.cwl' to '/Software/Virus_Detection_SRA/cwl/tools/sidearm.cwl'
[job alignsrr] /tmp/tmpwrkHIE$ align_SRR_to_references.pl \
    -b \
    bacteria \
    -s \
    ERR1301508 \
    -t \
    24
Running command: magicblast -db bacteria -sra ERR1301508 -num_threads 24 | samtools view -bS - | samtools sort -o ERR1301508.bacteria.bam
[bam_sort_core] merging from 6 files...
 
real    3160m12.524s
user    55314m21.432s
sys     2314m54.336s
[job alignsrr] completed success
[step alignsrr] completed success
[job bam2seqs] /tmp/tmp_oa6d1$ bam2seqs.pl \
    -b \
    /tmp/tmp0a8Jxn/stgbd8aab1c-18d8-4229-854c-91ce1720408a/ERR1301508.bacteria.bam \
    --nopaired \
    -f \
    fasta
[job bam2seqs] completed success
[step bam2seqs] completed success
[job summarizebam] /tmp/tmpxRixsj$ summarize_bam_by_ref.pl \
    -v \
    -f \
    /tmp/tmpdk6zi5/stga83b2467-03d7-473f-975d-5b8908d1f76f/ERR1301508.bacteria.bam \
    -g \
    /tmp/tmpdk6zi5/stg0697d797-439c-4a2a-b79e-65d5de9853bc/bacteria.fasta
Use of uninitialized value $rlen in division (/) at /Software/Virus_Detection_SRA/bin/summarize_bam_by_ref.pl line 73.
Illegal division by zero at /Software/Virus_Detection_SRA/bin/summarize_bam_by_ref.pl line 73.
[job summarizebam] completed permanentFail
[step summarizebam] completed permanentFail
[workflow sidearm.cwl] outdir is /tmp/tmpTwfjF3
{   
    "report_tsv": {
        "checksum": "sha1$b9e27ea41ef657c53a719eb630b045c013ab0c5a",
        "basename": "ERR1301508.bacteria.bam.summarize.tsv",
        "location": "file:///Software/Virus_Detection_SRA/cwl/tools/ERR1301508.bacteria.bam.summarize.tsv",
        "path": "/Software/Virus_Detection_SRA/cwl/tools/ERR1301508.bacteria.bam.summarize.tsv",
        "class": "File",
        "size": 66
    },
    "bamfile": {
        "checksum": "sha1$cefe33d6e2cfe77cf7145ebb8e89d5de7e58f49e",
        "basename": "ERR1301508.bacteria.bam",
        "location": "file:///Software/Virus_Detection_SRA/cwl/tools/ERR1301508.bacteria.bam",
        "path": "/Software/Virus_Detection_SRA/cwl/tools/ERR1301508.bacteria.bam",
        "class": "File",
        "size": 681364564
    }
}
Final process status is permanentFail

Error running example workflow for Ebola outbreak

I'm testing my installation of sidearm on my machine. I've installed all the dependencies and I'm trying the first example workflow on the Ebola virus outbreak. When I run the command ./sidearm.cwl sidearm.SRR1553459.ebola.yml, I get the error:

[las2@lmem06 tools]$ ./sidearm.cwl sidearm.SRR1553459.ebola.yml 
/usr/bin/env: cwl-runner --preserve-environment BLASTDB --preserve-environment PERL5LIB: Permission denied

My cwl-runner installation works, too. I would really appreciate any help in fixing this error!

web based UI

CWL-based management console

  • Graphical user interface

Parameters

For viral identification -- parameters need to be tested, and likely will be stricter than default.

For novel prediction, -- parameters need to be tested, and likely will be looser or same as default.

  • Can likely delete exact matches from identification runs.

adapter trimming module feature request

  • Add support for known adapters for various plaforms
    • i.e. Nextera
  • Automatic adapter prediction using Tagcleaner
    • user might want auto prediction in addition to providing their own adapters

Meaning of columns in .tsv files

I was just wondering what the precise definitions of the columns avgMAPQ, avgScore and avgEditDist are in the .tsv files. Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.