tseemann / snippy Goto Github PK

View Code? Open in Web Editor NEW

437.0 29.0 111.0 127.39 MB

:scissors: :zap: Rapid haploid variant calling and core genome alignment

License: GNU General Public License v2.0

Shell 0.56% Perl 98.84% Python 0.27% Makefile 0.34%

variant-calling haploid bacteria fastq-analysis snps vcf indel-discovery bioinformatics genomics

snippy's Introduction

Snippy

Rapid haploid variant calling and core genome alignment

Author

Torsten Seemann

Synopsis

Snippy finds SNPs between a haploid reference genome and your NGS sequence reads. It will find both substitutions (snps) and insertions/deletions (indels). It will use as many CPUs as you can give it on a single computer (tested to 64 cores). It is designed with speed in mind, and produces a consistent set of output files in a single folder. It can then take a set of Snippy results using the same reference and generate a core SNP alignment (and ultimately a phylogenomic tree).

Quick Start

% snippy --cpus 16 --outdir mysnps --ref Listeria.gbk --R1 FDA_R1.fastq.gz --R2 FDA_R2.fastq.gz
<cut>
Walltime used: 3 min, 42 sec
Results folder: mysnps
Done.

% ls mysnps
snps.vcf snps.bed snps.gff snps.csv snps.tab snps.html 
snps.bam snps.txt reference/ ...

% head -5 mysnps/snps.tab
CHROM  POS     TYPE    REF   ALT    EVIDENCE        FTYPE STRAND NT_POS AA_POS LOCUS_TAG GENE PRODUCT EFFECT
chr      5958  snp     A     G      G:44 A:0        CDS   +      41/600 13/200 ECO_0001  dnaA replication protein DnaA missense_variant c.548A>C p.Lys183Thr
chr     35524  snp     G     T      T:73 G:1 C:1    tRNA  -   
chr     45722  ins     ATT   ATTT   ATTT:43 ATT:1   CDS   -                    ECO_0045  gyrA DNA gyrase
chr    100541  del     CAAA  CAA    CAA:38 CAAA:1   CDS   +                    ECO_0179      hypothetical protein
plas      619  complex GATC  AATA   GATC:28 AATA:0  
plas     3221  mnp     GA    CT     CT:39 CT:0      CDS   +                    ECO_p012  rep  hypothetical protein

% snippy-core --prefix core mysnps1 mysnps2 mysnps3 mysnps4 
Loaded 4 SNP tables.
Found 2814 core SNPs from 96615 SNPs.

% ls core.*
core.aln core.tab core.tab core.txt core.vcf

Installation

Conda

Install Bioconda then:

conda install -c conda-forge -c bioconda -c defaults snippy

Homebrew

Install Homebrew (MacOS) or LinuxBrew (Linux) then:

brew install brewsci/bio/snippy

Source

This will install the latest version direct from Github. You'll need to add Snippy's bin directory to your $PATH.

cd $HOME
git clone https://github.com/tseemann/snippy.git
$HOME/snippy/bin/snippy --help

Check installation

Ensure you have the desired version:

snippy --version

Check that all dependencies are installed and working:

snippy --check

Calling SNPs

Input Requirements

a reference genome in FASTA or GENBANK format (can be in multiple contigs)
sequence read file(s) in FASTQ or FASTA format (can be .gz compressed) format
a folder to put the results in

Output Files

Extension	Description
.tab	A simple tab-separated summary of all the variants
.csv	A comma-separated version of the .tab file
.html	A HTML version of the .tab file
.vcf	The final annotated variants in VCF format
.bed	The variants in BED format
.gff	The variants in GFF3 format
.bam	The alignments in BAM format. Includes unmapped, multimapping reads. Excludes duplicates.
.bam.bai	Index for the .bam file
.log	A log file with the commands run and their outputs
.aligned.fa	A version of the reference but with `-` at position with `depth=0` and `N` for `0 < depth < --mincov` (does not have variants)
.consensus.fa	A version of the reference genome with all variants instantiated
.consensus.subs.fa	A version of the reference genome with only substitution variants instantiated
.raw.vcf	The unfiltered variant calls from Freebayes
.filt.vcf	The filtered variant calls from Freebayes
.vcf.gz	Compressed .vcf file via BGZIP
.vcf.gz.csi	Index for the .vcf.gz via `bcftools index`)

⚠️ ❌ Snippy 4.x does NOT produce the following files that Snippy 3.x did

Extension	Description
.vcf.gz.tbi	Index for the .vcf.gz via TABIX
.depth.gz	Output of `samtools depth -aa` for the `.bam` file
.depth.gz.tbi	Index for the `.depth.gz` file

Columns in the TAB/CSV/HTML formats

Name	Description
CHROM	The sequence the variant was found in eg. the name after the `>` in the FASTA reference
POS	Position in the sequence, counting from 1
TYPE	The variant type: snp msp ins del complex
REF	The nucleotide(s) in the reference
ALT	The alternate nucleotide(s) supported by the reads
EVIDENCE	Frequency counts for REF and ALT

If you supply a Genbank file as the --reference rather than a FASTA file, Snippy will fill in these extra columns by using the genome annotation to tell you which feature was affected by the variant:

Name	Description
FTYPE	Class of feature affected: CDS tRNA rRNA ...
STRAND	Strand the feature was on: + - .
NT_POS	Nucleotide position of the variant withinthe feature / Length in nt
AA_POS	Residue position / Length in aa (only if FTYPE is CDS)
LOCUS_TAG	The `/locus_tag` of the feature (if it existed)
GENE	The `/gene` tag of the feature (if it existed)
PRODUCT	The `/product` tag of the feature (if it existed)
EFFECT	The `snpEff` annotated consequence of this variant (ANN tag in .vcf)

Variant Types

Type	Name	Example
snp	Single Nucleotide Polymorphism	A => T
mnp	Multiple Nuclotide Polymorphism	GC => AT
ins	Insertion	ATT => AGTT
del	Deletion	ACGG => ACG
complex	Combination of snp/mnp	ATTC => GTTA

The variant caller

The variant calling is done by Freebayes. The key parameters under user control are:

--mincov - the minimum number of reads covering a site to be considered (default=10)
--minfrac - the minimum proportion of those reads which must differ from the reference
--minqual - the minimum VCF variant call "quality" (default=100)

Looking at variants in detail with `snippy-vcf_report`

If you run Snippy with the --report option it will automatically run snippy-vcf_report and generate a snps.report.txt which has a section like this for each SNP in snps.vcf:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>LBB_contig000001:10332 snp A=>T DP=7 Q=66.3052 [7]

         10301     10311     10321     10331     10341     10351     10361
tcttctccgagaagggaatataatttaaaaaaattcttaaataattcccttccctcccgttataaaaattcttcgcttat
........................................T.......................................
,,,,,,  ,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,t,,t,,,,,,,,,,,,,,,,g,,,,,,,g,,,,,,,,,t,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, .......T..................A............A.......
.........................A........A.....T...........    .........C..............
.....A.....................C..C........CT.................TA.............
,a,,,,,a,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,t,t,,,g,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,ga,,,,,,,c,,,,,,,t,,,,,,,,,,g,,,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
                            ............T.C..............G...............G......
                                                    ,,,,,,,g,,,,,,,,g,,,,,,,,,,,
                                                           g,,,,,,,,,,,,,,,,,,,,

If you wish to generate this report after you have run Snippy, you can run it directly:

cd snippydir
snippy-vcf_report --cpus 8 --auto > snps.report.txt

If you want a HTML version for viewing in a web browser, use the --html option:

cd snippydir
snippy-vcf_report --html --cpus 16 --auto > snps.report.html

It works by running samtools tview for each variant, which can be very slow if you have 1000s of variants. Using --cpus as high as possible is recommended.

Options

--rgid will set the Read Group (RG) ID (ID) and Sample (SM) in the BAM and VCF file. If not supplied, it will will use the --outdir folder name for both ID and SM.
--mapqual is the minimum mapping quality to accept in variant calling. BWA MEM using 60 to mean a read is "uniquely mapped".
--basequal is minimum quality a nucleotide needs to be used in variant calling. We use 13 which corresponds to error probability of ~5%. It is a traditional SAMtools value.
--maxsoft is how many bases of an alignment to allow to be soft-clipped before discarding the alignment. This is to encourage global over local alignment, and is passed to the samclip tool.
--mincov and --minfrac are used to apply hard thresholds to the variant calling beyond the existing statistical measure.. The optimal values depend on your sequencing depth and contamination rate. Values of 10 and 0.9 are commonly used.
--targets takes a BED file and only calls variants in those regions. Not normally needed unless you are only interested in variants in specific locii (eg. AMR genes) but are still performing WGS rather than amplicon sequencing.
--contigs allows you to call SNPs from contigs rather than reads. It shreds the contigs into synthetic reads, as to put the calls on even footing with other read samples in a multi-sample analysis.

Core SNP phylogeny

If you call SNPs for multiple isolates from the same reference, you can produce an alignment of "core SNPs" which can be used to build a high-resolution phylogeny (ignoring possible recombination). A "core site" is a genomic position that is present in all the samples. A core site can have the same nucleotide in every sample ("monomorphic") or some samples can be different ("polymorphic" or "variant"). If we ignore the complications of "ins", "del" variant types, and just use variant sites, these are the "core SNP genome".

Input Requirements

a set of Snippy folders which used the same --ref sequence.

Using `snippy-multi`

To simplify running a set of isolate sequences (reads or contigs) against the same reference, you can use the snippy-multi script. This script requires a tab separated input file as follows, and can handle paired-end reads, single-end reads, and assembled contigs.

# input.tab = ID R1 [R2]
Isolate1	/path/to/R1.fq.gz	/path/to/R2.fq.gz
Isolate1b	/path/to/R1.fastq.gz	/path/to/R2.fastq.gz
Isolate1c	/path/to/R1.fa		/path/to/R2.fa
# single end reads supported too
Isolate2	/path/to/SE.fq.gz
Isolate2b	/path/to/iontorrent.fastq
# or already assembled contigs if you don't have reads
Isolate3	/path/to/contigs.fa
Isolate3b	/path/to/reference.fna.gz

Then one would run this to generate the output script. The first parameter should be the input.tab file. The remaining parameters should be any remaining shared snippy parameters. The ID will be used for each isolate's --outdir.

% snippy-multi input.tab --ref Reference.gbk --cpus 16 > runme.sh

% less runme.sh   # check the script makes sense

% sh ./runme.sh   # leave it running over lunch

It will also run snippy-core at the end to generate the core genome SNP alignment files core.*.

Output Files

Extension	Description
.aln	A core SNP alignment in the `--aformat` format (default FASTA)
.full.aln	A whole genome SNP alignment (includes invariant sites)
.tab	Tab-separated columnar list of core SNP sites with alleles but NO annotations
.vcf	Multi-sample VCF file with genotype `GT` tags for all discovered alleles
.txt	Tab-separated columnar list of alignment/core-size statistics
.ref.fa	FASTA version/copy of the `--ref`
.self_mask.bed	BED file generated if `--mask auto` is used.

Why is `core.full.aln` an alphabet soup?

The core.full.aln file is a FASTA formatted mutliple sequence alignment file. It has one sequence for the reference, and one for each sample participating in the core genome calculation. Each sequence has the same length as the reference sequence.

Character	Meaning
`ATGC`	Same as the reference
`atgc`	Different from the reference
`-`	Zero coverage in this sample or a deletion relative to the reference
`N`	Low coverage in this sample (based on `--mincov`)
`X`	Masked region of reference (from `--mask`)
`n`	Heterozygous or poor quality genotype (has `GT=0/1` or `QUAL < --minqual` in `snps.raw.vcf`)

You can remove all the "weird" characters and replace them with N using the included snippy-clean_full_aln. This is useful when you need to pass it to a tree-building or recombination-removal tool:

% snippy-clean_full_aln core.full.aln > clean.full.aln
% run_gubbins.py -p gubbins clean.full.aln
% snp-sites -c gubbins.filtered_polymorphic_sites.fasta > clean.core.aln
% FastTree -gtr -nt clean.core.aln > clean.core.tree

Options

If you want to mask certain regions of the genome, you can provide a BED file with the --mask parameter. Any SNPs in those regions will be excluded. This is common for genomes like M.tuberculosis where pesky repetitive PE/PPE/PGRS genes cause false positives, or masking phage regions. A --mask bed file for M.tb is provided with Snippy in the etc/Mtb_NC_000962.3_mask.bed folder. It is derived from the XLSX file from https://gph.niid.go.jp/tgs-tb/
If you use the snippy --cleanup option the reference files will be deleted. This means snippy-core can not "auto-find" the reference. In this case you simply use snippy-core --reference REF to provide the reference in FASTA format.

Advanced usage

Increasing speed when too many reads

Sometimes you will have far more sequencing depth that you need to call SNPs. A common problem is a whole MiSeq flowcell for a single bacterial isolate, where 25 million reads results in genome depth as high as 2000x. This makes Snippy far slower than it needs to be, as most SNPs will be recovered with 50-100x depth. If you know you have 10 times as much data as you need, Snippy can randomly sub-sample your FASTQ data:

# have 1000x depth, only need 100x so sample at 10%
snippy --subsample 0.1  ...
<snip>
Sub-sampling reads at rate 0.1
<snip>

Only calling SNPs in particular regions

If you are looking for specific SNPs, say AMR releated ones in particular genes in your reference genome, you can save much time by only calling variants there. Just put the regions of interest into a BED file:

snippy --targets sites.bed ...

Finding SNPs between contigs

Sometimes one of your samples is only available as contigs, without corresponding FASTQ reads. You can still use these contigs with Snippy to find variants against a reference. It does this by shredding the contigs into 250 bp single-end reads at 2 × --mincov uniform coverage.

To use this feature, instead of providing --R1 and --R2 you use the --ctgs option with the contigs file:

% ls
ref.gbk mutant.fasta

% snippy --outdir mut1 --ref ref.gbk --ctgs mut1.fasta
Shredding mut1.fasta into pseudo-reads.
Identified 257 variants.

% snippy --outdir mut2 --ref ref.gbk --ctgs mut2.fasta
Shredding mut2.fasta into pseudo-reads.
Identified 413 variants.

% snippy-core mut1 mut2 
Found 129 core SNPs from 541 variant sites.

% ls
core.aln core.full.aln ...

This output folder is completely compatible with snippy-core so you can mix FASTQ and contig based snippy output folders to produce alignments.

Correcting assembly errors

The de novo assembly process attempts to reconstruct the reads into the original DNA sequences they were derived from. These reconstructed sequences are called contigs or scaffolds. For various reasons, small errors can be introduced into the assembled contigs which are not supported by the original reads used in the assembly process.

A common strategy is to align the reads back to the contigs to check for discrepancies. These errors appear as variants (SNPs and indels). If we can reverse these variants than we can "correct" the contigs to match the evidence provided by the original reads. Obviously this strategy can go wrong if one is not careful about how the read alignment is performed and which variants are accepted.

Snippy is able to help with this contig correction process. In fact, it produces a snps.consensus.fa FASTA file which is the ref.fa input file provided but with the discovered variants in snps.vcf applied!

However, Snippy is not perfect and sometimes finds questionable variants. Typically you would make a copy of snps.vcf (let's call it corrections.vcf) and remove those lines corresponding to variants we don't trust. For example, when correcting Roche 454 and PacBio SMRT contigs, we primarily expect to find homopolymer errors and hence expect to see ins more than snp type variants.

In this case you need to run the correcting process manually using these steps:

% cd snippy-outdir
% cp snps.vcf corrections.vcf
% $EDITOR corrections.vcf
% bgzip -c corrections.vcf > corrections.vcf.gz
% tabix -p vcf corrections.vcf.gz
% vcf-consensus corrections.vcf.gz < ref.fa > corrected.fa

You may wish to iterate this process by using corrected.fa as a new --ref for a repeated run of Snippy. Sometimes correcting one error allows BWA to align things it couldn't before, and new errors are uncovered.

Snippy may not be the best way to correct assemblies - you should consider dedicated tools such as PILON or iCorn2, or adjust the Quiver parameters (for Pacbio data).

Unmapped Reads

Sometimes you are interested in the reads which did not align to the reference genome. These reads represent DNA that was novel to your sample which is potentially interesting. A standard strategy is to de novo assemble the unmapped reads to discover these novel DNA elements, which often comprise mobile genetic elements such as plasmids.

By default, Snippy does not keep the unmapped reads, not even in the BAM file. If you wish to keep them, use the --unmapped option and the unaligned reads will be saved to a compressed FASTQ file:

% snippy --outdir out --unmapped ....

% ls out/
snps.unmapped.fastq.gz ....

Information

Etymology

The name Snippy is a combination of SNP (pronounced "snip") , snappy (meaning "quick") and Skippy the Bush Kangaroo (to represent its Australian origin)

License

Snippy is free software, released under the GPL (version 2).

Issues

Please submit suggestions and bug reports to the Issue Tracker

Requirements

perl >= 5.18
bioperl >= 1.7
bwa mem >= 0.7.12
minimap2 >= 2.0
samtools >= 1.7
bcftools >= 1.7
bedtools >= 2.0
GNU parallel >= 2013xxxx
freebayes >= 1.1 (freebayes, freebayes-parallel, fasta_generate_regions.py)
vcflib >= 1.0 (vcfstreamsort, vcfuniq, vcffirstheader)
vt >= 0.5
snpEff >= 4.3
samclip >= 0.2
seqtk >= 1.2
snp-sites >= 2.0
any2fasta >= 0.4
wgsim >= 1.8 (for testing only - wgsim command)

Bundled binaries

For Linux (compiled on Ubuntu 16.04 LTS) and macOS (compiled on High Sierra Brew) some of the binaries, JARs and scripts are included.

snippy's People

Contributors

Stargazers

Watchers

Forkers

autoinducer fw1121 roychaudhuri slugger70 tomrconnor aavilaherrera jnarvaezp blawlor nickloman lemetrec mnloff gmboowa tankmermaid rjbeng ohlab rpucheq bawee bioinfoacademy m-bull druvus pythseq sharbie88 fredrick-kakembo frahiriart yemilawal locusliu wangdi2014 barrantesisrael ypchan saramonzon robsyme maurijlozano tiredbird buihoangphuc412 jdeligt ja-lacey taylahj1 tauqeer9 yob1997 felipelira catrionapharkins superkits peterk87 pxhhappy donutbrew fmaguire ruru-adra fredyr animesh skerker labultet 1928d wbyu jianshu93 vishnuraghuram94 coolmaksat gunzivan28 yanyu-pan prmac ngocdung1412 ktmeaton standardgalactic maolingfengzju tw7649116 ssyamoako biogeeker mingjuhao carlosavelezgyahoo-com tanusaroha pryp88 pang-hd deensm esteinig yananzh wook2014 raphaelhablesreiter yh18862107529 chandrima-04 alipirani88 itxasarko zhichusun 186325 qhgenomics alanseb92 urzia zhangxiaodong8315 tangguo-guo chcrestani odetteg kaunath hsnguyen telatin nanyw123 vszhang1976 chris091089 stanikae jorgebotas hyzhou1990 gaworj absartalat

snippy's Issues

Errors when SNPs hidden in 'complex' variants

Because we are not joint-calling, we can get variant sites from different isolates which have different SNP types. eg. position 344421 could be 'snp' in isolate1, 'mnp' in 2, and 'complex' in 3 !

We are ignoring 'complex' types. Need to take them into consideration so we don't get false matches to reference!

SVG::Graph still required

SVG::GRaph is erroneously required during install via brew.

Freebayes over chunking when reference short

Freebayes will process 256 chunks of 3 bp, 64 chunks at a time.

Whoops.

snippy-core should produce a multi-sample VCF

Combine all the individual ones.

Add --fast option

Add a filter to only align to the first N reads such that we get approximately 2 * $mincov coverage of the reference genome.

This will allow most SNPs to be called but greatly speed up both bwa, samtools and freebayes.

Parallelize the --report component

As advertised, --report is very slow for lots of SNPs. Use GNU parallel to run in batches.

Add a snps.txt report summarizing things

eg.

Ref xxx.gbk
Reads R1.fq R2.fq
SNP 34
MNP 2
INS 4
DEL 7
ReadsAligned 454355
PercentAligned 86.2%

Add alignment statistics to .txt output

From samtools flagstat or samtools stats

Duplicate ref.fa IDs fails downstream

Not my fault, samtools faidx doesn't even check. But Freebayes fails. We should check it.

Allow --ref to be a GFF3 file, or add --gff option

Consider setting RG ID to --dir rather than --prefix

Current version not correctly honouring --minfrac

Although freebayes -F $minfrac is being used, Erik says that is only a guide to trigger a variant call. For a single sample like ours, it probably works, but not guaranteed?

snippy install on OS X (chapmanb/cbl/vcflib 2015-03-13 did not build)

See downstream report here chapmanb/homebrew-cbl#22.

Please provide static executables

uqmstan1@barrine2:~/Beatson_shared/bin/snippy/1.4/binaries/linux> ./bcftools
./bcftools: /lib64/libz.so.1: no version information available (required by ./bcftools)
./bcftools: /lib64/libc.so.6: version GLIBC_2.14' not found (required by ./bcftools) ./bcftools: /lib64/libc.so.6: versionGLIBC_2.15' not found (required by ./bcftools)

uqmstan1@barrine2:~/Beatson_shared/bin/snippy/1.4/binaries/linux> ./bwa
./bwa: /lib64/libc.so.6: version GLIBC_2.14' not found (required by ./bwa) ./bwa: /lib64/libc.so.6: versionGLIBC_2.15' not found (required by ./bwa)

./freebayes: /lib64/libc.so.6: version GLIBC_2.15' not found (required by ./freebayes) ./freebayes: /lib64/libc.so.6: versionGLIBC_2.14' not found (required by ./freebayes)
./freebayes: /sw/gcc/4.5.2/lib64/gcc/x86_64-suse-linux/4.5.2/libstdc++.so.6: version `GLIBCXX_3.4.15' not found (required by ./freebayes)

uqmstan1@barrine2:~/Beatson_shared/bin/snippy/1.4/binaries/linux> ./samtools
./samtools: error while loading shared libraries: libtinfo.so.5: cannot open shared object file: No such file or directory

uqmstan1@barrine2:~/Beatson_shared/bin/snippy/1.4/binaries/linux> ./vcfstreamsort
./vcfstreamsort: /lib64/libc.so.6: version GLIBC_2.15' not found (required by ./vcfstreamsort) ./vcfstreamsort: /lib64/libc.so.6: versionGLIBC_2.14' not found (required by ./vcfstreamsort)
./vcfstreamsort: /sw/gcc/4.5.2/lib64/gcc/x86_64-suse-linux/4.5.2/libstdc++.so.6: version `GLIBCXX_3.4.15' not found (required by ./vcfstreamsort)

uqmstan1@barrine2:~/Beatson_shared/bin/snippy/1.4/binaries/linux> ./vcfuniq
./vcfuniq: /lib64/libc.so.6: version GLIBC_2.15' not found (required by ./vcfuniq) ./vcfuniq: /lib64/libc.so.6: versionGLIBC_2.14' not found (required by ./vcfuniq)
./vcfuniq: /sw/gcc/4.5.2/lib64/gcc/x86_64-suse-linux/4.5.2/libstdc++.so.6: version `GLIBCXX_3.4.15' not found (required by ./vcfuniq)

BAM file as input

Hi!

Any plans on including BAM format as input to snippy..? It would be convenient a convenient option :)

Best,
Shaman

allow use of targets file (freebayes -t)

It's useful to restrict the analysis to a list of targets in bed format (freebayes -t flag). I find this more efficient than filtering the vcf by position with tabix, vcftools etc afterwards.

Is it possible to have *.aligned.fa with variants?

Hi Torst.

Would it be possible for .aligned.fa to also include variants (SNPs)? .consensus.fa includes all variants (including indels --- meaning it has a different length from the reference) making it hard to cut using coordinates from the reference.

My idea would be to produce a snps.vcf.gz with SNPs only, and then apply vcf-concensus. Would that work?

Anders.

Update docs for --unmapped

Show amino acid / codon effect when feature is a CDS

Include in FEATURES column in snps.tab and snps.csv

snps.aligned.fa has ref ID rather than Isolate in FASTA header

Output of core genome size for subset of samples

Hi Torsten

Could snippy/snippy-core please output the number of core genome sites/percentage core shared amongst a group of isolates. Could possibly be added to the core.txt file where all the individual % aligned to reference data is?

Thanks,
Bainesy

Put all the ref.* file in a subdirectory of --outdir for neatness

Error when using GBK file

Thanks for creating this pipeline - it is very useful. I'm able to get everything to work if I use just a multi-fasta file for a reference, however, I'm getting an error during the snpEff build step if I use a GBK file. I've tried both an NCBI GBK and one I created manually - both give me the same error

Here's the error from the log file:

### snpEff build -c reference/snpeff.config -dataDir . -gff3 ref

java.lang.RuntimeException: Error reading file '/Users/ngs/Desktop/Anid_snps/MRB326/reference/./ref/genes.gff'
java.lang.RuntimeException: FATAL ERROR: Offending line (lineNum: 15): 'Chr1    snippy  CDS 1976    1986    .   +   0   ID=AN10814;codon_start=1;locus_tag=AN10814;product=Has domain(s) with predicted oxidoreductase activity and role in metabolic process;protein_id=ncbi:rna_AN10814-1;translation=MRSESSTDVLIVGAGPAGLTTALWLAHTGVQFRIIDKRPNIPRRGQADGLSPRTMEILETFEVAHEVTRLWERATDEMLWCRDAQGNLTRMERFRNQPPQGVRWGHGTLQQGVVEEIMKKKITEVCGVEVEYETTLFELSLDTTKANDPEAFPWSATVRYGTDEPAGQMLSKTMLAKYVVGADGGRSFVRQTMGIEMQGTKGEAVWGVMDIIGTSDFPDDIDGAVDFVRREKDRDAITPESIIVKCEYIIRPYKLYIKEHVWWSAFTVAQRISNSMAVHNRGFLVGDAVHTHSPLCGAGMNTAVQLHRDAYNLGWKLAGVLKHQLNPEILQTYGAERRPVAEALLDADKTILALFHAPLGPEAEALLAKADHIQAYLSGRGIQYRASLLTHGSAEGLRSLAVLPGHCIPDITVQNYMTGRASNLHSWIMADGGWSVIFWASDLSCASRVENIHNCCKQVESIRAKTSSKVGYMLDAFLIHCNEWPSVDLAALPGLFLPPSKYGCLDYGKIFVREETASCKSERMNNLGGIAVVRPDKYVGWVGGLDDMVGLGLYFSKIFL'. File '/Users/ngs/Desktop/Anid_snps/MRB326/reference/./ref/genes.gff' line 15
    'Chr1   snippy  CDS 1976    1986    .   +   0   ID=AN10814;codon_start=1;locus_tag=AN10814;product=Has domain(s) with predicted oxidoreductase activity and role in metabolic process;protein_id=ncbi:rna_AN10814-1;translation=MRSESSTDVLIVGAGPAGLTTALWLAHTGVQFRIIDKRPNIPRRGQADGLSPRTMEILETFEVAHEVTRLWERATDEMLWCRDAQGNLTRMERFRNQPPQGVRWGHGTLQQGVVEEIMKKKITEVCGVEVEYETTLFELSLDTTKANDPEAFPWSATVRYGTDEPAGQMLSKTMLAKYVVGADGGRSFVRQTMGIEMQGTKGEAVWGVMDIIGTSDFPDDIDGAVDFVRREKDRDAITPESIIVKCEYIIRPYKLYIKEHVWWSAFTVAQRISNSMAVHNRGFLVGDAVHTHSPLCGAGMNTAVQLHRDAYNLGWKLAGVLKHQLNPEILQTYGAERRPVAEALLDADKTILALFHAPLGPEAEALLAKADHIQAYLSGRGIQYRASLLTHGSAEGLRSLAVLPGHCIPDITVQNYMTGRASNLHSWIMADGGWSVIFWASDLSCASRVENIHNCCKQVESIRAKTSSKVGYMLDAFLIHCNEWPSVDLAALPGLFLPPSKYGCLDYGKIFVREETASCKSERMNNLGGIAVVRPDKYVGWVGGLDDMVGLGLYFSKIFL'

    at ca.mcgill.mcb.pcingola.snpEffect.factory.SnpEffPredictorFactoryGff.create(SnpEffPredictorFactoryGff.java:132)
    at ca.mcgill.mcb.pcingola.snpEffect.commandLine.SnpEffCmdBuild.createSnpEffPredictor(SnpEffCmdBuild.java:115)
    at ca.mcgill.mcb.pcingola.snpEffect.commandLine.SnpEffCmdBuild.run(SnpEffCmdBuild.java:304)
    at ca.mcgill.mcb.pcingola.snpEffect.commandLine.SnpEff.run(SnpEff.java:883)
    at ca.mcgill.mcb.pcingola.snpEffect.commandLine.SnpEff.main(SnpEff.java:128)

Genome chunking wrong if --ref .gbk

Uses file size as proxy for genome size.
This does not work for .gbk --ref files!

No docs for --ctgs option

fails with samtools 1.3

Looks like something changed between 1.1 and 1.3 regarding the -T -o options to sort. See the log file below.

### samtools faidx reference/ref.fa


### bwa index reference/ref.fa

[bwa_index] Pack FASTA... 0.02 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 0.84 seconds elapse.
[bwa_index] Update BWT... 0.02 sec
[bwa_index] Pack forward-only FASTA... 0.03 sec
[bwa_index] Construct SA from BWT and Occ... 0.45 sec
[main] Version: 0.7.12-r1039
[main] CMD: bwa index reference/ref.fa
[main] Real time: 1.653 sec; CPU: 1.369 sec

### mkdir reference/genomes && cp -f reference/ref.fa reference/genomes/ref.fa


### mkdir reference/ref && bgzip -c reference/ref.gff > reference/ref/genes.gff.gz


### (bwa mem  -v 2 -M -R '@RG   ID:snps SM:snps' -t 16 reference/ref.fa /nv/vol46/uvabx/projects/houpt/samara-mtb/snps/ERR027468_1.fastq.gz /nv/vol46/uvabx/projects/houpt/samara-mtb/snps/ERR027468_2.fastq.gz | samtools view -@ 16 -F 3844 -q 60 -S -b -u -T ref.fa - | samtools sort -@ 16 - snps)

[bam_sort] Use -T PREFIX / -o FILE to specify temporary and final output files
Usage: samtools sort [options...] [in.bam]
Options:
  -l INT     Set compression level, from 0 (uncompressed) to 9 (best)
  -m INT     Set maximum memory per thread; suffix K/M/G recognized [768M]
  -n         Sort by read name
  -o FILE    Write final output to FILE rather than standard output
  -T PREFIX  Write temporary files to PREFIX.nnnn.bam
  -@, --threads INT
             Set number of sorting and compression threads [1]
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
  -O, --output-fmt FORMAT[,OPT[=VAL]]...
               Specify output format (SAM, BAM, CRAM)
      --output-fmt-option OPT[=VAL]
               Specify a single output file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]
[samfaipath] fail to read file ref.fa.
[M::mem_pestat] analyzing insert size distribution for orientation FF...
[M::mem_pestat] (25, 50, 75) percentile: (314, 741, 1299)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 3269)
[M::mem_pestat] mean and std.dev: (885.94, 718.35)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 4254)
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (142, 172, 213)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 355)
[M::mem_pestat] mean and std.dev: (179.11, 54.24)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 426)
[M::mem_pestat] analyzing insert size distribution for orientation RF...
[M::mem_pestat] (25, 50, 75) percentile: (622, 1410, 3220)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 8416)
[M::mem_pestat] mean and std.dev: (2037.37, 1944.74)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 11014)
[M::mem_pestat] analyzing insert size distribution for orientation RR...
[M::mem_pestat] (25, 50, 75) percentile: (386, 721, 1296)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 3116)
[M::mem_pestat] mean and std.dev: (864.82, 652.77)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 4026)
[M::mem_pestat] skip orientation FF
[M::mem_pestat] skip orientation RF
[M::mem_pestat] skip orientation RR

Exception: Unable to read reference sequence base past end of current cached sequence.

In one of my projects with 100's of samples, the total size of the snippy output directories was > 20X larger than expected. I realised that the log files are each filled with ~25000-160000 instances of:

Exception: Unable to read reference sequence base past end of current cached sequence.

e.g.

Exception: Unable to read reference sequence base past end of current cached sequence.
jla513_unpol:36
35-336
alignment: AATTTCATAACATCAACTAGGGTTTGGTCCGAAGCATACGTGTTTACAATGTTCGAACAACTTATACAGTTCTTATACATACTTTATAAATTATTTCCCAAACTGTTTTGATACACTCACTAACAGATACTCTATAGAAGGAAAAGTTATCCACTTATGCACATTTATAGTTTTCAGAATTGTGGATAATTAGAAATTACACACAAAGTTATACTATTTTTAGCAACATATTCACAGGTATTTGACATATAGAGAACTGAAAAAGTATAATTGTGTGGATAAGTCGTCCAACTCATGATTT
currentSequence: CGATTAAAGATAGAAATACACGATGCGAGCAATCAAATTTCATAACATCACCATGAGTTTGGTCCGAAGCATGAGTGTTTACAATGTTCGAACACCTTATACAGTTCTTATACATACTTTATAAATTATTTCCCAAACTGTTTTGATACACTCACTAACAGATACTCTATAGAAGGAAAAGTTATCCACTTATGCACATTTATAGTTTTCAGAATTGTGGATAATTAGAAATTACACACAAAGTTATACTATTTTTAGCAACATATTCACAGGTATTTGACATATAGAGAACTGAAAAAGTATAATTGTGTGGATAAGTCGTCCAACTCATGATT
currentSequence matching:

This seems to be a known freebayes issue freebayes/freebayes#165 and installing v1.0.1 seems to have resolved it for me. The number of SNPs increases with v.1.0.1. Don't know if this is related to the exception or to some other change in freebayes behaviour.

The .csv file is really .tsv

Perhaps write a .tab file, then also write a true .csv from that using Text::CSV ?

Also write a simple snps.html report

Might be useful for deploying to websites, in conjunction with the --report in HTML format.

snippy-core error

Hi Torsten, having issues with snippy-core as follows:
$snippy-core --prefix core mysnps1 mysnps2 mysnps3
"any" is not exported by the List::Util module Can't continue after import errors at /mydir/snippy/bin/snippy-core line 5. BEGIN failed--compilation aborted at /mydir/snippy/bin/snippy-core line 5.

Perhaps this is a usage/install error but would appreciate help.

Allow SNP finding between reference and a contigs file

Incorrect numbering/reference nucleotide

The nucleotide listed as the reference in snps.tab file is not always (or perhaps ever) the correct one for the nucleotide position and codon position listed.

Missing 13bp deletion due to freebayes -F 0.9 :-/

NRS384_contig000001 2460777 . CAGTAGGATACATTG CG 4140.99 . AB=0;ABP=0;AC=1;AF=1;AN=1;AO=140;CIGAR=1M13D1M;DP=140;DPB=21;DPRA=0;EPP=9.21451;EPPR=0;GTI=0;LEN=13;MEANALT=1;MQM=60;MQMR=0;NS=1;NUMALT=1;ODDS=953.498;PAIRED=1;PAIREDR=0;PAO=17.5;PQA=577.5;PQR=514.5;PRO=15.5;QA=4536;QR=0;RO=0;RPL=86;RPP=18.8931;RPPR=0;RPR=54;RUN=1;SAF=54;SAP=18.8931;SAR=86;SRF=0;SRP=0;SRR=0;TYPE=del GT:DP:RO:QR:AO:QA:GL 1:140:0:0:140:4536:-414.099,0

snpEff results are wierd sometimes

The SNP and AA positions are correct but the amino acids are wrong.

core.tab headers off

Currently core.tab has SNP effect (e.g., "missense_variant c.741C>A p.Asp247Glu") under the LOCUS_TAG heading.

Anders.

no freebayes-parallel

We are having a look at snippy.

With exception of perl, bioperl & awk all other binaries appear to be bundled.

uqmstan1@barrine2:~/Beatson_shared/bin/snippy/1.4/test> make
../bin/snippy --outdir test1 --force --ref example.fna reads_R1.fastq.gz reads_R2.fastq.gz
[09:37:26] This is snippy 1.4
[09:37:26] Written by Torsten Seemann [email protected]
[09:37:26] Obtained from http://www.vicbioinformatics.com
[09:37:26] Detected operating system: linux
[09:37:26] Enabling bundled tools in: /work2/UQ/Science/SCMB/Beatson/bin/snippy/1.4/bin/binaries/{linux,noarch}
[09:37:26] Found cp - ok.
[09:37:26] Found rm - ok.
[09:37:26] Found awk - ok.
[09:37:26] Found renice - ok.
[09:37:26] Found bwa - ok.
[09:37:26] Found samtools - ok.
[09:37:26] Found vcfutils.pl - ok.
which: no freebayes-parallel in

[09:37:26] Can not find 'freebayes-parallel' in PATH
make: *** [test1] Error 2

uqmstan1@barrine2:~/Beatson_shared/bin/snippy/1.4/binaries> ls *
darwin:
bwa samtools

linux:
bcftools bwa freebayes samtools vcfstreamsort vcfuniq

noarch:
fasta_generate_regions.py freebayes-parallel parallel vcfutils.pl

Show full path to discovered binaries

So we can tell which binary actually got used when there could be more than one.

BGzip included in snippy

Note,

binary version of BGZIP included with source code is dependent on glibc, with the static compile that is included compiled against glibc 2.14. Had to recompile from samtools/htseq from my own install of samtools and replace binary.

Show output files 1 per line at end of log

snippy-core too slow with lots of samples

You can just cat the individual "snps.aligned.fa" files dummy!

Including 'complex' variants introduces monomorphics to nway output

Need logic to remove monomorphic columns (resulting from 'complex' types where some bases match the reference) so they don't end up in nway.tab.

I don't think they will end up in regular .tab output though.

"variant did not have ANN annotation" in EFFECT field of .tab

the "EFFECT" field of the .tab output contains mostly "variant did not have ANN annotation".

snippy v2.9
gbk reference from prokka v1.11

Only ever used snippy with prokka gbk files and never had "variant did not have ANN annotation" before. Tried reannotating with prokka but issue presists.

snippy-core

When entering folder names in command-line for snippy-core, the trailing '/' causes an error with generating the ID, particularly if there are multiple folders with the trailing slash. This allocates multiple samples to have an ID = ' '.

MSG: Got a sequence without letters. Could not guess alphabet

[12:30:16] Generating aligned/masked FASTA relative to reference: snps.aligned.fa

--------------------- WARNING ---------------------
MSG: Got a sequence without letters. Could not guess alphabet
---------------------------------------------------

--------------------- WARNING ---------------------
MSG: Got a sequence without letters. Could not guess alphabet
---------------------------------------------------

--------------------- WARNING ---------------------
MSG: Got a sequence without letters. Could not guess alphabet
---------------------------------------------------

--------------------- WARNING ---------------------
MSG: Got a sequence without letters. Could not guess alphabet
---------------------------------------------------

--------------------- WARNING ---------------------
MSG: Got a sequence without letters. Could not guess alphabet
---------------------------------------------------
[12:30:20] Creating extra output files: BED GFF CSV TXT HTML

vcf-consensus option

It would be useful to have a vcf-consensus option built in:
$ bgzip snps.vcf
$ tabix -p vcf snps.vcf.gz
$ vcf-consensus snps.vcf.gz < ref.fa > new.fa

Snippy help for analyzing output

Hi Seemann,

Greetings for the day!!!

I have just started using the snippy tool for finding snp's from a bacterial genome and i found it really awesome and very fast detection and annotation tool.

I have a few questions:

I have used a draft genome as a reference. How confident can i be of the snp's called and annotated?
Also this genome has been sequenced on Iontorrent platform. So they are single end reads. Is the tool suitable for this platform.?
There is a file called .depth.gz. is it the file that indicated the depth of coverage of snp's called? IS there any other file from which i can take the depth parameters to take into consideration the real snp's?

Thanks in advance,
Ramya.

tseemann / snippy Goto Github PK

snippy's Introduction

Snippy

Author

Synopsis

Quick Start

Installation

Conda

Homebrew

Source

Check installation

Calling SNPs

Input Requirements

Output Files

Columns in the TAB/CSV/HTML formats

Variant Types

The variant caller

Looking at variants in detail with snippy-vcf_report

Options

Core SNP phylogeny

Input Requirements

Using snippy-multi

Output Files

Why is core.full.aln an alphabet soup?

Options

Advanced usage

Increasing speed when too many reads

Only calling SNPs in particular regions

Finding SNPs between contigs

Correcting assembly errors

Unmapped Reads

Information

Etymology

License

Issues

Requirements

Bundled binaries

snippy's People

Contributors

Stargazers

Watchers

Forkers

snippy's Issues

Recommend Projects

Recommend Topics

Recommend Org

Looking at variants in detail with `snippy-vcf_report`

Using `snippy-multi`

Why is `core.full.aln` an alphabet soup?