sanger-pathogens / snp-sites Goto Github PK

Finds SNP sites from a multi-FASTA alignment file

Home Page: http://sanger-pathogens.github.io/snp-sites/

License: Other

C 87.36% Perl 2.99% Makefile 0.96% M4 6.74% Dockerfile 0.27% C++ 1.68%

genomics sequencing next-generation-sequencing research bioinformatics bioinformatics-pipeline global-health infectious-diseases pathogen

snp-sites's Introduction

SNP-sites

Rapidly extracts SNPs from a multi-FASTA alignment.

Introduction
Installation
Usage
License
Feedback/Issues
Citation

Introduction

Rapidly decreasing genome sequencing costs have led to a proportionate increase in the number of samples used in prokaryotic population studies. Extracting single nucleotide polymorphisms (SNPs) from a large whole genome alignment is now a routine task, but existing tools have failed to scale efficiently with the increased size of studies. These tools are slow, memory inefficient and are installed through non-standard procedures. We present SNP-sites which can rapidly extract SNPs from a multi-FASTA alignment using modest resources and can output results in multiple formats for downstream analysis. SNPs can be extracted from a 8.3 GB alignment file (1,842 taxa, 22,618 sites) in 267 seconds using 59 MB of RAM and 1 CPU core, making it feasible to run on modest computers. It is easy to install through the Debian and Homebrew package managers, and has been successfully tested on more than 20 operating systems. SNP-sites is implemented in C and is available under the open source license GNU GPL version 3.

Installation

There are a few ways to install SNP-sites. The simpliest way is using apt (Debian/Ubuntu) or Conda. If you encounter an issue when installing SNP-sites please contact your local system administrator. If you encounter a bug please log it here.

Linux - Ubuntu/Debian
OSX/Linux - using Bioconda
OSX/Linux - from source
OSX/Linux - from a release tarball

Linux - Ubuntu/Debian

If you have a recent version of Ubuntu or Debian then you can install it using apt.

   apt-get install snp-sites

OSX/Linux - using Bioconda

Install Conda and install the bioconda channels.

conda config --add channels conda-forge
conda config --add channels defaults
conda config --add channels r
conda config --add channels bioconda
conda install snp-sites

OSX/Linux - from source

This is a difficult method and is only suitable for someone with advanced unix skills. No support is provided with this method, since you have advanced unix skills. Please consider using Conda instead. First install a standard development environment (e.g. gcc, automa
ke, autoconf, libtool). Download the software from GitHub.

autoreconf -i -f
./configure
make
sudo make install

OSX/Linux - from a release tarball

tar xzvf snp-sites-x.y.z.tar.gz
cd snp-sites-x.y.z
./configure
make
sudo make install

All platforms - Docker

Bioconda produce a Docker container so you can use the software out of the box. Install Docker and then pull the container from Bioconda https://quay.io/repository/biocontainers/snp-sites

Running the tests

The test can be run from the top level directory:

autoreconf -i
./configure
make
make check

This requires libcheck (the check package in Ubuntu) to be installed.

Usage

Usage: snp-sites [-mvph] [-o output_filename] <file>
This program finds snp sites from a multi fasta alignment file.
 -r     output internal pseudo reference sequence
 -m     output a multi fasta alignment file (default)
 -v     output a VCF file
 -p     output a phylip file
 -o STR specify an output filename [STDOUT]
 -c     only output columns containing exclusively ACGT
 -b     output monomorphic sites, used for BEAST
 -h     this help message
 -V     print version and exit
 <file> input alignment file which can optionally be gzipped

This application takes in a multi fasta alignment, finds all the SNP sites, then outputs the SNP sites in the following formats:

a multi fasta alignment,
VCF,
relaxed phylip format.

Example input

For the given input file:

>sample1
AGACACAGTCAC
>sample1
AGACAC----AC
>sample1
AAACGCATTCAN

the output is:

>sample1
GAG
>sample1
GA-
>sample1
AGT

Example usage

snp-sites my_alignment.aln
snp-sites my_gzipped_alignment.aln.gz

Output

Multi Fasta Alignment - Similar to the input file but just containing the SNP sites.
VCF - This contains the position of each SNP in the reference sequence, and the occurrence in each other sample. Can be loaded into Artemis for visualisation.
Relaxed Phylip format - All the SNP sites in a format for RAxML and other tree building applications.

License

SNP-sites is free software, licensed under GPLv3.

Feedback/Issues

This software is community supported. Please report any issues to the issues page.

Citation

If you use this software please cite:

"SNP-sites: rapid efficient extraction of SNPs from multi-FASTA alignments", Andrew J. Page, Ben Taylor, Aidan J. Delaney, Jorge Soares, Torsten Seemann, Jacqueline A. Keane, Simon R. Harris, Microbial Genomics 2(4), (2016)

snp-sites's People

Contributors

Stargazers

Watchers

snp-sites's Issues

SegFault for unknown reason

I'm getting a somewhat mysterious segfault when running snp-sites (latest version in conda, installed as in instructions).

The reason it's mysterious is that the file which is causing the segfault is just a cat of two files which individually run through snp-sites without error. The genomes in the two files input to cat are the same length (both in 60 bp per line fasta format), and the output of cat looks fine by both seqkit stats and by visually inspecting where the two files have been joined.

Any thoughts?

Details below...

ubuntu@pennaeth:~/tm_data/phylo/results$ seqkit stats 2018.10.12/2018.10.12.all_tm.fasta
file                                format  type  num_seqs        sum_len     min_len     avg_len     max_len
2018.10.12/2018.10.12.all_tm.fasta  FASTA   DNA        191  5,470,978,215  28,643,865  28,643,865  28,643,865

ubuntu@pennaeth:~/tm_data/phylo/results$ seqkit stats 2018.10.15/2017.12.11.prelim_tm_data.reform.fa
file                                          format  type  num_seqs        sum_len     min_len     avg_len     max_len
2018.10.15/2017.12.11.prelim_tm_data.reform.fa  FASTA   DNA         35  1,002,535,275  28,643,865  28,643,865  28,643,865

ubuntu@pennaeth:~/tm_data/phylo/results$ cat 2018.10.12/2018.10.12.all_tm.fasta 2018.10.15/2017.12.11.prelim_tm_data.reform.fa > 2018.10.15/tmp.fa

ubuntu@pennaeth:~/tm_data/phylo/results$ seqkit stats 2018.10.15/tmp.fa
file               format  type  num_seqs        sum_len     min_len     avg_len     max_len
2018.10.15/tmp.fa  FASTA   DNA        226  6,473,513,490  28,643,865  28,643,865  28,643,865

ubuntu@pennaeth:~/tm_data/phylo/results$ snp-sites 2018.10.12/2018.10.12.all_tm.fasta
ubuntu@pennaeth:~/tm_data/phylo/results$ snp-sites 2018.10.15/2017.12.11.prelim_tm_data.reform.fa
ubuntu@pennaeth:~/tm_data/phylo/results$ snp-sites 2018.10.15/tmp.fa 
Segmentation fault (core dumped)
ubuntu@pennaeth:~/tm_data/phylo/results$ 

ubuntu@pennaeth:~/tm_data/phylo/results$ ls -lh *
-rw-rw-r-- 1 ubuntu ubuntu 2.7M Oct 15 07:39 2017.12.11.prelim_tm_data.reform.fa.snp_sites.aln
-rw-rw-r-- 1 ubuntu ubuntu  45M Oct 15 07:38 2018.10.12.all_tm.fasta.snp_sites.aln

2018.10.12:
total 5.3G
-rw-rw-r-- 1 ubuntu ubuntu 5.2G Oct 12 05:33 2018.10.12.all_tm.fasta

2018.10.15:
total 21G
-rw-rw-r-- 1 ubuntu ubuntu 957M Oct 15 06:37 2017.12.11.prelim_tm_data.fa
-rw-rw-r-- 1 ubuntu ubuntu 973M Oct 15 06:58 2017.12.11.prelim_tm_data.reform.fa
-rw-rw-r-- 1 ubuntu ubuntu 6.2G Oct 15 07:36 tmp.fa

Opens <file> twice?

I think the code opens the alignment file twice, once to load the first alignment (the 'reference') then re-opens it again, skips over the first sequence, then reads the rest.

Is there a way this could only open it once to allow piping of stdin to stdout so it can be used as a pipe filter?

Is the reference genome used when using the snp-sites software?

Dear teacher：
I found snp-site very good, but I have doubts. The input file I am using is coregene.aln generated by roary. Does it use the reference genome?Without a reference genome, how does it find snp?
Thanks in advance

No ./configure in release tarball despite docs

2.4.0 has nno configure script

I had to still to autoreconf -i -f

First sequence in alignment is "special" ?

I was wondering if the first sequence in the alignment is considered "special" in some (undocumented) way?

I see the code does something unusual:

       // First sequence is the reference sequence so skip it
       // If there is an indel in the reference sequence, replace with the first proper base you find

line 3030: syntax error near unexpected token `CHECK,check' on REHL 7.5

Fails to build from the github download with the following error:

./configure: line 3030: syntax error near unexpected tokenCHECK,check'
./configure: line 3030: PKG_CHECK_MODULES(CHECK,check >= 0.8.2,have_check="yes",'checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... /usr/bin/mkdir -p
checking for gawk... gawk
checking whether make sets $(MAKE)... yes
checking whether make supports nested variables... yes
checking build system type... x86_64-pc-linux-gnu
checking host system type... x86_64-pc-linux-gnu
./configure: line 3030: syntax error near unexpected token CHECK,check' ./configure: line 3030: PKG_CHECK_MODULES(CHECK,check >= 0.8.2,have_check="yes",'

Possibility to specify known reference

Hi, would there be possible to have an option to specify a reference so that the REF column respect this one and the genotype calls in the vcf as well?

Thanks a lot for this tool!

Have a nice day!
JC

Didn't mean to open this

Error with Mauve alignment input

Mauve formatted alignment input throws error about sequences of unequal length (by 1bp). Using conda installed version 2.4.1 with OSX. Is this a known issue?

segmentation fault in one fasta but not on the other

Hello, I am able to run snp-sites with one of my fasta files, but I get a segmentation fault when I try to run it with another one.

The output I get from gdb is

Core was generated by `snp-sites -mvp -o 8snps monoref_multi_8.fasta'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007f6e8aa61827 in generate_snp_sites ()
    from /home/slh/.linuxbrew/cellar/snp-sites/2.2.0/lib/libsnp-sites.so.1

Any idea as to how to solve this?

Thanks!

make check => cannot find -lsubunit

% make check

<snip>
/usr/bin/ld: cannot find -lsubunit
collect2: error: ld returned 1 exit status
make[2]: *** [run-all-tests] Error 1

snp-sites do not see do deletion events

Dear,

In order to validate snp-sitesin our lab we have compared the results with msa2vcf.jar

Thus we have took two close sequences from ncbi (ref NC_045512.2 and MT007544.1)

msa2vcf

$ java -jar dist/msa2vcf.jar --consensus 'NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome' --output ../sars_cov2.2.vcf ../sars_cov2.aln
[INFO][MsaToVcf]format : Fasta
$ cat ../sars_cov2.2.vcf
##fileformat=VCFv4.2
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##contig=<ID=chrUn,length=29903>
##msa2vcf.meta=compilation:20200728120720 githash:af51aa3 htsjdk:2.22.0 date:20200728122012 cmd:--consensus NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome --output ../sars_cov2.2.vcf ../sars_cov2.aln
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  MT007544.1 Severe acute respiratory syndrome coronavirus 2 isolate Australia/VIC01/2020, complete genome        NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome
chrUn   19065   .       T       C       .       .       DP=2    GT:DP   1/1:1   0/0:1
chrUn   22303   .       T       G       .       .       DP=2    GT:DP   1/1:1   0/0:1
chrUn   26144   .       G       T       .       .       DP=2    GT:DP   1/1:1   0/0:1
chrUn   29749   .       ACGATCGAGTG     A       .       .       DP=2    GT:DP   1/1:1   0/0:1

snp-sites

$ snp-sites -c -v -o sars_cov2.vcf  sars_cov2.aln
$ cat sars_cov2.vcf
##fileformat=VCFv4.1
##contig=<ID=1,length=29903>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NC_045512.2     MT007544.1
1       19065   .       T       C       .       .       .       GT      0       1
1       22303   .       T       G       .       .       .       GT      0       1
1       26144   .       G       T       .       .       .       GT      0       1

Problem

snp-sites do not report the deletion

FYI - updated brew package to 2.x series

FYI - https://github.com/Homebrew/homebrew-science/pull/2945

How to manage heterozygosity in SNP conversion?

Hello,

Sorry for this (I guess) basic question, but I did not find the answer in the README.md file nor in the paper (Page et al. 2016).

I try to convert FASTA alignments into a SNP-extracted VCF format for downstream analyses. Some alignments are for nuclear markers, and I work on a polyploid organism, so I sometimes have more than 2 haplotypes for a given individual, but all are properly phased.

My FASTA input is formated as follow:

Individual1_a
Allele-a-sequence
Individual1_b
Allele-b-sequence
Individual2_a
Allele-a-sequence
Individual2_b
Allele-b-sequence
Individual2_c
Allele-c-sequence
...

I used a basic command:

snp-sites -v -o out.vcf in.fas

And I indeed got a .vcf file. But in this file, each allele seems coded as a homozygous individual, I see no 0/0/1 or even 0/1 in the output as expected, but rather only 0, 1 and 2 (like haploid calls).

How could I get an output so that phasing information and heterozygosity are considered? Is there an option in snp-sites that I missed? Or do I have to adapt my input, and how? (Like, loosing the phasing information by merging the alleles, getting only 1 sequence per individual but with ambiguities?! Is that mandatory?)

Thank you for any answer.

Failing tests on various platforms

The test suite for 2.2.2 fails with a bunch of tests on various platforms, e.g. i386.
Here's the message:

FAIL: run-all-tests
=========================================
   snp-sites 2.2.2: src/test-suite.log
=========================================

# TOTAL: 1
# PASS:  0
# SKIP:  0
# XFAIL: 0
# FAIL:  1
# XPASS: 0
# ERROR: 0

.. contents:: :depth: 2

FAIL: run-all-tests
===================

Running suite(s): Creating_SNP_Sites
Alignment ../tests/data/uneven_alignment.aln contains sequences of unequal length. Expected length is 8 but got 9 in sequence Uneven_number_of_bases

76%: Checks: 21, Failures: 5, Errors: 0
../tests/check-snp-sites.c:40:F:snp_sites:valid_alignment_with_one_line_per_sequence:0: Invalid VCF file for 1 line per seq
../tests/check-snp-sites.c:79:F:snp_sites:valid_alignment_with_multiple_lines_per_sequence:0: Invalid VCF file for multiple lines per seq
../tests/check-snp-sites.c:67:F:snp_sites:valid_alignment_with_one_line_per_sequence_gzipped:0: Invalid VCF file for 1 line per seq
../tests/check-snp-sites.c:53:F:snp_sites:valid_alignment_with_n_as_gap:0: Invalid VCF file for 1 line per seq
../tests/check-snp-sites.c:132:F:snp_sites:valid_with_all_outputted_with_custom_name:0: Custom name needs extra extension for VCF
Running suite(s): Creating_VCF_file
100%: Checks: 3, Failures: 0, Errors: 0
FAIL run-all-tests (exit status: 1)

============================================================================
Testsuite summary for snp-sites 2.2.2
============================================================================
# TOTAL: 1
# PASS:  0
# SKIP:  0
# XFAIL: 0
# FAIL:  1
# XPASS: 0
# ERROR: 0
============================================================================
See src/test-suite.log
============================================================================

A full log is at https://buildd.debian.org/status/fetch.php?pkg=snp-sites&arch=i386&ver=2.2.2-1&stamp=1458222828 for example.

I can reproduce this on a Jessie i386 Vagrant box. A quick bisect flagged commit b2efeb4288d4201480408b7dfe3e314c243e53c6 as the first bad one, but I'm not familiar enough with what is being tested to say more.

Linuxbrew installation issue (links issue)

[rbutler@genomics 85]$ brew install snp-sites
==> Installing snp-sites from homebrew/science
==> Installing dependencies for homebrew/science/snp-sites: patchelf
==> Installing homebrew/science/snp-sites dependency: patchelf
==> Downloading https://linuxbrew.bintray.com/bottles/patchelf-0.9_1.x86_64_linux.bottle.tar.gz

################################################################## 100.0%

==> Pouring patchelf-0.9_1.x86_64_linux.bottle.tar.gz
/home/rbutler/.linuxbrew/Cellar/patchelf/0.9_1: 6 files, 1.2M
==> Installing homebrew/science/snp-sites
==> Downloading https://linuxbrew.bintray.com/bottles-science/snp-sites-2.2.0.x86_64_linux.bottle.tar.gz

################################################################## 100.0%

==> Pouring snp-sites-2.2.0.x86_64_linux.bottle.tar.gz
Error: An unexpected error occurred during the brew link step
The formula built, but is not symlinked into /home/rbutler/.linuxbrew
No such file or directory @ realpath_rec - /home/linuxbrew
Error: No such file or directory @ realpath_rec - /home/linuxbrew
[rbutler@genomics 85]$ brew update
Already up-to-date.
[rbutler@genomics 85]$ brew tap homebrew/science
[rbutler@genomics 85]$ brew install snp-sites
Warning: homebrew/science/snp-sites-2.2.0 already installed, it's just not linked

snps-sites tool help

I am working on Klebsiella pneumoniae phylogenetic analysis based on the snps sites of the core genome (snp-sites tool).
I would appreciate very much if you could help to localize the correct output files which show the following information:

- the distribution and location of SNPs for each isolate along the core genome 
- the ratio of nonsynonymous-to-synonymous SNPs

Thank you very much for your help

32 bit signed integer error

Hi,
I noticed a similar issue to a previously closed one (#80 ), which I'm experiencing with the most recent version (2.5.1) of snp-sites.
It appears that sequences longer than 2,147,483,647 bases give the error "Warning: No SNPs were detected so there is nothing to output." 2,147,483,647 is the maximum value for a 32 bit signed integer.
I've spent a bit of time looking into this and here's what I've done to prove this.

I took two sequences from an alignment, one of which was the outgroup, so as to maximise the number of snps.
Each sequence was 2,423,158,460 bases in length:

$ cat sample1.fasta Outgroup.fasta > test.fasta

$ snp-sites -V
snp-sites 2.5.1

$ snp-sites -c -o test_snps.fasta test.fasta
Warning: No SNPs were detected so there is nothing to output.

I then cut the length of the sequences down 2,147,483,648 - one base longer than 32 bit signed integer maximum value:

$ cut -c 1-2147483648 test.fasta > test1.fasta
$ snp-sites -c -o test1_snps.fasta test1.fasta
Warning: No SNPs were detected so there is nothing to output.

I then cut the length of the sequence down 2,147,483,647 - the 32 bit signed integer maximum value:

$ cut -c 1-21474836487 test.fasta > test2.fasta
$ snp-sites -c -o test2_snps.fasta test2.fasta
/opt/slurm/data/slurmd/job28028674/slurm_script: line 13: 38321 Segmentation fault      snp-sites -c -o test1_snps.fasta test1.fasta

I then cut the length of the sequence down 2,147,483,646 - one base less than the 32 bit signed integer maximum value:

$ cut -c 1-21474836486 test.fasta > test3.fasta
$ snp-sites -c -o test3_snps.fasta test3.fasta

This time snp-sites ran successfully and identifies 28,880,245 variant sites

So it seems that sequence-lengths which are at the limit of a 32 bit signed integer maximum value cause a segmentation fault, and when you go over that limit causes snp-sites to suggest there are no SNPs

Graham

Add a --version switch?

For pipeline auditing (--version, or -v or -V if you can only use short options)

% snp-sites --version
snp-sites 2.0.2

autoreconf -i warnings => option 'subdir-objects' is disabled

src/Makefile.am:26: warning: source file '../tests/check-snp-sites.c' is in a subdirectory,
src/Makefile.am:26: but option 'subdir-objects' is disabled
automake: warning: possible forward-incompatibility.
automake: At least a source file is in a subdirectory, but the 'subdir-objects'
automake: automake option hasn't been enabled.  For now, the corresponding output
automake: object file(s) will be placed in the top-level directory.  However,
automake: this behaviour will change in future Automake versions: they will
automake: unconditionally cause object files to be placed in the same subdirectory
automake: of the corresponding sources.
automake: You are advised to start using 'subdir-objects' option throughout your
automake: project, to avoid future incompatibilities.
src/Makefile.am:26: warning: source file '../tests/check-vcf.c' is in a subdirectory,
src/Makefile.am:26: but option 'subdir-objects' is disabled
src/Makefile.am:26: warning: source file '../tests/helper-methods.c' is in a subdirectory,
src/Makefile.am:26: but option 'subdir-objects' is disabled
src/Makefile.am:26: warning: source file '../tests/run-all-tests.c' is in a subdirectory,
src/Makefile.am:26: but option 'subdir-objects' is disabled

Missing data in vcf output format

Hello,

I was using your tool to obtain the SNPs in a fasta alignment and using vcf as an output format. I noticed that in cases where my reference is some nucleotide (A, C, G or T), samples that have missing data (N), will have a 0 - becoming REF genotype - and won't be coded as missing data anymore.

For example:

#CHROM | POS | ID | REF | ALT | QUAL | FILTER | INFO | FORMAT | BID_1 | CHS_2 | LAM_1 | LAM_2 | LAM_3 |
1 | 3032469 | . | C | T | . | . | . | GT | 0 | 0 | 0 | 0 | 1 |

When this position in the alignment is:

BID_1
C
CHS_2
C
LAM_1
N
LAM_2
N
LAM_3
T

I was wondering if this is the normal behavior? Or should I code missing data in another format (? or -), so that missing data will be properly noticed by the tool?

Thanks,
Mafalda

Keeping indels, projecting reference coordinates?

Hi there,
Is there any planned (or implemented) was to analyze indels via snp-sites? Basically, I want to annotate all variants (not just snps) in my multiple-sequence alignment and project those variants against the original reference (which also has indels) coordinate system.

Thanks,
John

Warning: No SNPs were detected so there is nothing to output

I've got a set of 972 Mtb isolates that I'm trying to run snp-sites -c -o on, but it fails with the error Warning: No SNPs were detected so there is nothing to output. However, it works with removing the -c flag. How can I try and figure out which isolates are causing problems?

Is there a way to extract SNP mutations form the VCF file ?

Hello,

I have run snp-sites and I got the three files.
But I want to to have SNP mutations from VCF file.
Is there a way from snp-sites ?

Thank you.

Add Arxiv paper link to README.md

snp-sites docker install issue

Hi,

I have downloaded Docker on my mac (Mojave OS10.14.6), trying to pull the snp-sites container using the command found here: https://quay.io/repository/biocontainers/snp-sites

I unfortunately get the following error:

~$docker pull quay.io/biocontainers/snp-sites
Using default tag: latest
Error response from daemon: manifest for quay.io/biocontainers/snp-sites:latest not found: manifest unknown: manifest unknown

Am I doing something wrong here?

Thanks!

Docker file typo

#
# Install Roary
#
RUN apt-get install snp-sites

:-P

apt-get installation of snp-sites

Hello how to install latest version using apt-get on linux ubuntu? All I can get is v1.5.0.

Thanks!

Gaps not reported in output?

Hi,

thanks for this very fast and elegant tool. I was wondering whether there is any option to also consider gaps. I am using this alignment, which has a region with gaps in a single sequence. I ran the following command:

snp-sites -o out.fasta aln.trimmed.fasta
head out.fasta

>genome|b0463
GCGCAA
>NT12002_188|b0463
GCGCAA
>NT12003_214|b0463
GCGCAA
>NT12004_22|b0463
GCGCAA
>NT12005_17|b0463
GTGCAA

The output is only including the snps and not the gaps. I imagine this is due to the fact that the region with gaps doesn't have any SNPs?

Is this an expected behaviour?
Best,
Marco

Segmentation error

I am using snp-sites version 2.4.1, it is used in the snippy pipeline to detect snp in a full alignement. Error message from this pipeline is:

snp-sites: symbol lookup error: snp-sites: undefined symbol: generate_snp_sites_with_ref_pure_mono
ERROR: Could not run: snp-sites -c -o 2_strains.aln 2_strains.full.aln

I have tried to launch again with only two strains and got a segmentation error

snp-sites 2_strains.full.aln

Any ideas?

2_strains.full.aln.zip

Output invariant sites and nucleotide frequencies

In general, phylogenetic programs use invariant sites for likelihood calculations. However, a number of programs, such as RAxML and BEAST, can perform ascertainment bias corrections given the number of invariant sites and the frequencies of nucleotides in the alignment. If SNP-sites output these values, they could be used as direct inputs for RAxML, for example.

possible to convert to redhat?

Maybe this is a dumb question, but I'm having a hard time installing this on my school's linux box. They are running redhat, and I am not sure if it is exactly compatible. The configure command for starters doesn't seem to work. I was looking into converting the package but it looks like it requires another program that requires root access, which I do not have.

Is there a sequence length limit? "Warning: No SNPs were detected so there is nothing to output."

Hello,
I have an alignment file with full consensus genome sequences of 6 samples in exactly the same frames with the same number of bases without any blanks or indels but only with SNPs. And when I run snp-sites with default settings I get the error message:
"Warning: No SNPs were detected so there is nothing to output."

When I get the first 1000 bases for each sample without changing anything else in my file (for instance by doing cut -c1-1000) the program works and finds the SNPs; just stating so that it is clear that my installation and file formats work fine.

Samples have about 2.3 billion bases and I am working on an HPCC with over 500GB ram available and I don't get any other error related to memory. If you know the sequence length limit, could you please let me know so that I subset my file to the maximum length?

Thanks!

Print to screen

Hi there,
Just a suggestion - an output file seems to be required to run commands, but it might be nice if we could just print to screen if we just want to have a quick look at the data.
Thanks!

how to visualize VCF file in artemis?

Hello, I am new in the bioinformatics field. According to snp-sites tool, the output vcf file can be visualized in artemis?

is any processing necessary for that?? I have vcf file and I am trying to load in artemis, nothing happening.

I have parsed using bgzip and tabix. Artemis fails to open vcf.gz.tbi file because of not recognizing binary file format

could you please suggest me how can I do so?

Problem with installation via Bioconda

Hi!
I just tried to install snp-sites using Bioconda and get the following message:

**UnavailableInvalidChannel: The channel is not accessible or is invalid.
channel name: snp-sites
channel url: https://conda.anaconda.org/snp-sites
error code: 404

You will need to adjust your conda configuration to proceed.
Use conda config --show channels to view your configuration's current state,
and use conda config --show-sources to view config file locations.**

I copied the url into my browser but its not available there..

Best,
Martinique

error: alignment contains sequences of unequal length

Hi,

I have generated an alignment between 2 genomes using progressiveMauve (default parameters) and I'm now trying to extract SNPs using snp-sites.

My issue is that I get the error message: '' Alignment my_ali contains sequences of unequal length. Expected length is 42875 but got 42876 in sequence ''

However both sequences have a length of 42876, but both sequences have 1 indel '-'.

any idea about how to fix that ?

thanks
Romain

snp-sites -c doesn't produce recognisable alignment for IQtree

Hello,

I am trying to generate a core genome tree for a bacterial plant pathogen local outbreak (Ralstonia solanacearum) using the output of IQtree with the output of snp-sites -c. I have tried generating an alignment with snippy-core with a reference strain and with de novo assembly alignment done with mafft but in both of these cases IQtree crashes by not recognising the input file as an alignment. I suspect that it has something to do with the output from snp-sites -c being just a multifasta and therefore unrecognisable to IQtree but my understanding was that this functionality of iqtree is specifically for snp-sites. I have now tried it with another data set and I get the same result.

I have attached the snp alignment from snippy fed to iqtree after snp-sites -c, also fconst output and iqtree error log files.
The commands I used to generate the files are:
$ snp-sites -C core.full.aln > fconst_output.txt
$ snp-sites -c core.full.aln > snp-sites.aln
$ iqtree -fconst fconst_output.txt -s snp-sites.txt

With output from snippy:
fconst_output.txt
iqtree_error.log
snp-sites.txt

Let me know if more information is needed!
Thank you!

specify reference sequence

How can someone specify that a given sequence is the reference and comparisons should be relative to the given sequence? Does the tool assume that the first sequence in the alignment is the reference sequence?

How do we run the tests?

There is a tests folder but it's not clear how to trigger it?

Pipe into snp-sites

From RT:
Hi there.

Thanks for the snpsites program, very helpfull.
I havent been able to pipe (|) into snp-sites yet, how can this be done? Otherwise I would strongly recommend it.

The example on https://github.com/sanger-pathogens/snp-sites makes no sense to me, which command was used here?

Thanks.

brew recipe needs to be updated to the latest release.

Current linuxbrew recipe only installs version 2.2.0.

Build is failing on Mac OS

Required user to install automake, autoconf, libtool, and check.

Fails to build from the github download with the following error:

./configure: line 3030: syntax error near unexpected tokenCHECK,check'
./configure: line 3030: PKG_CHECK_MODULES(CHECK,check >= 0.8.2,have_check="yes",'

Option to disallow gap and/or N and/or non-AGTC ?

We are impressed by the speed of this tool (due to being C code).

A very useful feature we need to the ability to also filter out things like:

gap -
N
non-AGTC eg * and X etc

These would need to be independent options.

Ideally the current default behaviour to remove conserved (monmorphic) sites could also be an option. eg. so we could remove all columns with a gap only and leave the rest.

Recommended workflow?

What is the recommended workflow to produce a VCF file when starting with two unaligned bacterial whole-genome fasta files, each approximately 3 million nucleotides? Any particular aligner and file format conversion utilities you can recommend?

specify order of output of -C option in help message

Thanks for the new -C function for counting constant sites...super useful! Can I recommend explicitly stating that the output of -C is in A,C,G,T order in the help message? Some users might not know that alphabetical order is convention.

functionality of -b option alone?

I am trying to understand what the -b option does when it is not paired with the -c option. I am working with a 4,640,668 bp long alignment.

When I run snp-sites on the alignment without either the -b or -c options, I get a resulting alignment of 1,733 sites. My understanding is that these are all of the variant sites in the full alignment, regardless of whether or not there is missing data (N, -, or ?) in some samples.
When I run snp-sites -c, I get 944 variant sites (ACGT-only sites), which implies there are 789 variant sites with missing data in at least one sample (1,733 - 944 = 789).
When I run snp-sites -cb, I get an alignment of 2,903,621 bp. My understanding is that this is the ACGT-only sites plus the monomorphic sites (944 + 2,902,677 = 2,903,621).
Based on the above logic, I assumed that running snp-sites -b would give me all of the 1,733 variant sites (both ACGT-only and those with missing data) plus the monomorphic sites (1,733 + 2,902,677 = 2,904,410).
However, when I do run snp-sites -b I get the complete 4,640,668 bp alignment.

Am I missing something about what the -b option is doing? Any help would be much appreciated. Thank you!

segmentation fault with `-b` option

Version:

$ snp-sites -V
snp-sites 2.3.2

No problems with standard usage e.g. outputting variant sites only.

On Linux:

$ snp-sites -b -o monomorphic.fa full.aln 
Segmentation fault (core dumped)

On MacOS:

$ snp-sites -b -o monomorphic.fa full.aln
Segmentation fault: 11

Is there a way to turn on a verbose or debugging mode?
I can supply the alignment file (multifasta) if needed. 13 sequences, each 1.89Mbp in length.

Thanks for your help.

Info in VCF output format

Is there a way with snp-sites to get data on to the info column on a VCF output?

Segmentation fault with larger sequences (on Desktop and HPC)

Hi,

I am running into a segmentation fault when calling the variants on a fasta with genome size > approx 9Mb (8Mb of the same fasta works fine).
Any thoughts as to why this is the case? I couldn't see a hard maximum genome length in the code.

edit: generating some test fastas with the included perl script reveals that it is an issue with the number of variants in the alignment, which makes sense.

Cheers

Chris

sanger-pathogens / snp-sites Goto Github PK

snp-sites's Introduction

SNP-sites

Contents

Introduction

Installation

Linux - Ubuntu/Debian

OSX/Linux - using Bioconda

OSX/Linux - from source

OSX/Linux - from a release tarball

All platforms - Docker

Running the tests

Usage

Example input

Example usage

Output

License

Feedback/Issues

Citation

snp-sites's People

Contributors

Stargazers

Watchers

Forkers

snp-sites's Issues

msa2vcf

snp-sites

Problem

################################################################## 100.0%

################################################################## 100.0%

Recommend Projects

Recommend Topics

Recommend Org