marbl / parsnp Goto Github PK

Parsnp was designed to align the core genome of hundreds to thousands of bacterial genomes within a few minutes to few hours. Input can be both draft assemblies and finished genomes, and output includes variant (SNP) calls, core genome phylogeny and multi-alignments. Parsnp leverages contextual information provided by multi-alignments surrounding SNP sites for filtration/cleaning, in addition to existing tools for recombination detection/filtration and phylogenetic reconstruction.

License: Other

Python 7.58% Shell 0.02% Makefile 0.30% C++ 87.68% C 2.36% M4 2.05%

parsnp's People

Contributors

Stargazers

Watchers

Forkers

hyphaltip sminot zzz2010 grassa decaos wangdi2014 aquaplant rnandety brittanymareeott astrobiomike advaitb jellila gabrielabio wook2014 emollier zhangxiaodong8315 abubakariabdulwasid

parsnp's Issues

ERROR: ref genome sequence * seems to aligned! remove and restart

Hello there,

I'm hoping to use parsnp to find SNPs in a small set of SAGs (Single Amplified Genomes). I've got six sets of contigs sequenced and assembled from single bacterial cells in the folder NBK19/. Since there are no references for this species, I tried to run the autorecruiting option first, and then the FASTA reference option selecting a random genome from the folder. My commands looked like the following (executed within the directory:

$ parsnp -q AG-470-J09_contigs.fasta -d . -p 4
$ parsnp -r ! -d . -p 4

In both cases, parsnp reads the available FASTA files and then dies:

-->Reading Genbank file(s) for reference (.gbk) ..
|->[WARNING]: no genbank file provided for reference annotations, skipping..
ERROR: ref genome sequence AG-470-J09_contigs.fasta seems to aligned! remove and restart

The actual reference sequence varies when I'm running the random reference FASTA option, but the gist of the error is always "seems to aligned!" Any idea what this is about?

In case it's relevant, I'm running this on OSX.

Thanks,
/Ingo Ohlsson

Compiled Parsnp-OSX64-v1.2.tar, hangs and errors, OS X 10.12 issue?

After updating my Mac to OS X 10.12.6, I downloaded the binary for parsnp and tried to run previously working commands. First issue is that the help page takes over a minute to display:

sh-3.2$ time ./Parsnp-OSX64-v1.2/parsnp                                                                           
|--Parsnp v1.2--|
For detailed documentation please see --> http://harvest.readthedocs.org/en/latest
usage: parsnp [options] [-g|-r|-q](see below) -d <genome_dir> -p <threads>

Parsnp quick start for three example scenarios: 
1) With reference & genbank file: 
...
 -P = <int>: max partition size? limits memory usage (default= 15000000)
 -v = <flag>: (v)erbose output? (default = NO)
 -V = <flag>: output (V)ersion and exit


real    1m14.101s
user    1m10.017s
sys     0m0.621s

Second, previously working commands error like:

sh-3.2$ time ./Parsnp-OSX64-v1.2/parsnp -c -x -p 4 -r Mvel_NIH1002.PB_DATA.canu.SK.fasta -d ./genomes/            
|--Parsnp v1.2--|
For detailed documentation please see --> http://harvest.readthedocs.org/en/latest
***********************************************************************************
SETTINGS:
|-refgenome:    Mvel_NIH1002.PB_DATA.canu.SK.fasta
|-aligner:      libMUSCLE
|-seqdir:       ./genomes/
|-outdir:       /Users/conlans/Desktop/Mucor_parsnp_redo/P_2018_05_07_140352716154
|-OS:           Darwin
|-threads:      4
***********************************************************************************

<<Parsnp started>>

-->Reading Genome (asm, fasta) files from ./genomes/..
  |->[OK]
-->Reading Genbank file(s) for reference (.gbk) ..
  |->[WARNING]: no genbank file provided for reference annotations, skipping..
-->Running Parsnp multi-MUM search and libMUSCLE aligner..
**ERROR**
The following command failed:
>>/var/folders/h5/kzbqz8hs5dz1q1234dltgkxh0pbxfz/T/_MEICrRy19/parsnp /Users/conlans/Desktop/Mucor_parsnp_redo/P_2018_05_07_140352716154/parsnpAligner.ini
Please veryify input data and restart Parsnp. If the problem persists please contact the Parsnp development team.
**ERROR**


real    1m20.935s
user    1m14.233s
sys     0m1.290s

Parsnp -d sometimes fails to recruit random files

Todd,

If I create a folder called 'fasta' with 20 small identical fasta files and run "parsnp -r '!' -d fasta" and run it, often my resulting tree only has 19 genomes in it, and other times 20. The 'missing' genome is somewhat random, and is missing from the RECRUITED GENOMES list. By running the command over and over again I get different results.

This bug has us confused, so I'm thinking it might be a non-deterministic parallel race condition maybe? Even though I'm using default -p 1.

Torsten

Work with 1000 genomes

Dear team parsnp

I have a directory with 1320 assemblies genomes. I run the command

parsnp -g ../../DB_all/genome_reference_VP/CP007004.gbk,../../DB_all/genome_reference_VP/CP007005.gbk -d ../../Data/all_v_p_for_parsnp/ -p 6

It works well but only aligment 73 genomes.

######################

For detailed documentation please see --> http://harvest.readthedocs.org/en/latest

-->Reading Genome (asm, fasta) files from ../../Data/all_v_p_for_parsnp/..
|->[OK]
-->Reading Genbank file(s) for reference (.gbk) ../../DB_all/genome_reference_VP/CP007004.gbk,../../DB_all/genome_reference_VP/CP007005.gbk..
|->[OK]
-->Calculating MUMi..
|->[OK]
-->Running Parsnp multi-MUM search and libMUSCLE aligner..

|->[OK]
-->Running PhiPack on LCBs to detect recombination..
|->[SKIP]
-->Reconstructing core genome phylogeny..
|->[OK]
-->Creating Gingr input file..
|->[OK]
-->Calculating wall clock time..
|->Aligned 73 genomes in 1.00 hours

<<Parsnp finished! All output available in /home/ubuntu/Results/all_1327_parsnp/P_2018_06_11_185050006955>>

Validating output directory contents...
1)parsnp.tree: newick format tree [OK]
2)parsnp.ggr: harvest input file for gingr (GUI) [OK]
3)parsnp.xmfa: XMFA formatted multi-alignment [OK]

parsnp v1.1 how convert XMFA to phylip/nexus/fasta ?

Hi.

I like so much this program to explore core genome and identify SNV. But, some times i want to explort the alig (XMFA) to another format to use another program with ML method.
How i can do this work? I tried with perl script but the name of strain dissapearr!, only show the index (1,2,3,4...etc).

Thanks.

installation problem

Dear Developers,

I got the problem when installing the software at the last step of INSTALL:

/usr/bin/ld: cannot find -lMUSCLE-3.7
collect2: error: ld returned 1 exit status
Makefile:345: recipe for target 'parsnp' failed
make[1]: *** [parsnp] Error 1
make[1]: Leaving directory '/home/thkuo/bin/parsnp/src'
Makefile:329: recipe for target 'install-recursive' failed
make: *** [install-recursive] Error 1

Could you help me to solve it?

Parsnp.py input for MUMi

Parsnp.py
Line 491 : the usage (@262) states states that setting the MUMi cutoff should be "-i" and typed as <float> , here it is set to inifile.
using "-i" generates the error "No ini file with this name"
please fix the usage to show that "-U" is required for MUMi distance and update read the docs

Not incorporating all files

I'm running parsnp on a directory containing 575 strains, and only 573 of them are being incorporated into the analysis. I've used the -c argument to dictate that my directory is curated. The command I used was:

parsnp -c -v -r /Scratch.NFS/.../EGDe.fasta -d /Scratch.NFS/.../Cat/ -p 24 -o ParsnpTest

Any thoughts?

-c option MUMi

Hi,

When I use -M option to report MUM indexes, not all of the genomes are included - so I tried to use -c option to force include all genomes in the directory, but then get

Traceback (most recent call last):
  File "<string>", line 876, in <module>
AttributeError: 'str' object has no attribute 'close'

Is there another way of getting MUMi for all genomes? When running parsnp without -M option a file called all.mumi is generated but this doesn't include file names. recruitedgenomes.lst contains file names and MUMi, but only for those that were recruited (which is not the entire dataset).

Order of genomes in ini file matters?

It seems like the order in which genomes are added to the parsnp ini file is:

not consistent platform to platform. I get different orders from the OS X and Linus compiled versions
seems to matter. For otherwise identical ini files, the output trees and %core differ based on order of the genomes in the ini file
doesn't seem to be controllable be by the user.

Is there a way to control the order that the genomes are added? Alternatively, is there a way to feed an ini file directly to parsnp so I can experiment with this?

Different output from source 1.0 and binary 1.0.1

Please tag a source release that corresponds to the source that was used to compile the 1.0.1 binary.

The source 1.0 reports

❯❯❯ parsnp
ERROR: No ini file specified!

The binary 1.0.1 reports

❯❯❯ parsnp_OSX64_v1.0.1/parsnp
|--Parsnp v1.0--|
For detailed documentation please see --> http://harvest.readthedocs.org/en/latest
usage: parsnp [options] [-g|-r|-q](see below) -d <genome_dir> -p <threads>

Parsnp quick start for three example scenarios: 
1) With reference & genbank file: 
 >parsnp -g <reference_genbank_file1,reference_genbank_file2,..> -d <genome_dir> -p <threads> 

2) With reference but without genbank file:
 >parsnp -r <reference_genome> -d <genome_dir> -p <threads> 

3) Autorecruit reference to a draft assembly:
 >parsnp -q <draft_assembly> -d <genome_db> -p <threads> 

[Input parameters]
<<input/output>>
 -c = <flag>: (c)urated genome directory, use all genomes in dir and ignore MUMi? (default = NO)
 -d = <path>: (d)irectory containing genomes/contigs/scaffolds
 -r = <path>: (r)eference genome (set to ! to pick random one from genome dir)
 -g = <string>: Gen(b)ank file(s) (gbk), comma separated list (default = None)
 -o = <string>: output directory? default [./P_CURRDATE_CURRTIME]
 -q = <path>: (optional) specify (assembled) query genome to use, in addition to genomes found in genome dir (default = NONE)

<<MUMi>>
 -U = <float>: max (M)UMi distance (default: autocutoff based on distribution of MUMi values)
 -M = <flag>: calculate MUMi and exit? overrides all other choices! (default: NO)

<<MUM search>>
 -a = <int>: min (a)NCHOR length (default = 1.1*Log(S))
 -C = <int>: maximal cluster D value? (default=100)
 -z = <path>: min LCB si(z)e? (default = 25)

<<LCB alignment>>
 -D = <float>: maximal diagonal difference? Either percentage (e.g. 0.2) or bp (e.g. 100bp) (default = 0.12)
 -e = <flag> greedily extend LCBs? experimental! (default = NO)
 -n = <string>: alignment program (default: libMUSCLE)

<<SNP filtration>>
 -R = <flag>: disable (R)epeat filtering?
 -x = <flag>: enable recombination filtering? (default: NO)

<<Misc>>
 -h = <flag>: (h)elp: print this message
 -p = <int>: number of threads to use? (default= 1)
 -P = <int>: max partition size? limits memory usage (default= 15000000)
 -v = <flag>: (v)erbose output? (default = NO)

parsnp tree node labels

Hi, I ran parsnp successfully and have a Newick tree - are the node labels here the bootstrap values? If not, what are the node label values of? Is it possible to generate a tree with bootstrap values (or extract that information)?

parsnp for bins from metagenomic assembly

Hello,

I have a couple of hundreds of bins corresponding to strains of a bacterial species. Is it sensible to use parsnp to infer sites of recombination in core genes based on this type of data?

Thank you

OSX compiled version

Hi,
parsnp and the Harvest suite look very promising, but I'm having trouble locating the precompiled OSX version of parsnp - https://github.com/marbl/parsnp/releases/download/v1.0/parsnp-OSX64.gz seems to result in a 404 error. Could you point me to the correct download link please? I'm eager to try them out!
Thanks very much.

Sequence headers in .xmfa

Does anyone know how to interpret the sequence headers in the .xmfa file?

For example, take the following:
>28:462049-462515 - cluster8 s134:p5601

I understand that "28" indicates the genome, "462049-462515" the coordinates, "-" the strand, and "cluster8" the LCB. But what about "s134:p5601"? What does this indicate?

I already read about the format here and couldn't find the answer:
http://darlinglab.org/mauve/user-guide/files.html

Thanks!

Mapping coordinates of the MAF back to the reference

Anyone know if someone has solved how to map coordinates from the multi-alignment file output back to the original coordinate son the reference genbank file?

If it hasn't been done, I make have a a try at it.

Failure to install from source

Any help would be appreciated. Here's the error I got:

/usr/bin/perl: symbol lookup error: /usr/common/usg/languages/perl/5.24.0/extra/lib/perl5/x86_64-linux-thread-multi/auto/List/Util/Util.so: undefined symbol: Perl_xs_handshake
Now run ./configure --prefix=/global/homes/s/snayfach && make install
Add --disable-shared to the configure line if building on Mac OS X
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... /usr/bin/mkdir -p
checking for gawk... gawk
checking whether make sets $(MAKE)... yes
checking whether the C++ compiler works... yes
checking for C++ compiler default output file name... a.out
checking for suffix of executables... 
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C++ compiler... yes
checking whether g++ accepts -g... yes
checking for style of include used by make... GNU
checking dependency style of g++... gcc3
checking whether ln -s works... yes
checking build system type... x86_64-unknown-linux-gnu
checking host system type... x86_64-unknown-linux-gnu
checking how to print strings... printf
checking for gcc... gcc
checking whether we are using the GNU C compiler... yes
checking whether gcc accepts -g... yes
checking for gcc option to accept ISO C89... none needed
checking dependency style of gcc... gcc3
checking for a sed that does not truncate output... /usr/bin/sed
checking for grep that handles long lines and -e... /usr/bin/grep
checking for egrep... /usr/bin/grep -E
checking for fgrep... /usr/bin/grep -F
checking for ld used by gcc... /global/common/genepool_sl72/usg/languages/gcc/7.1.0/x86_64-pc-linux-gnu/bin/ld
checking if the linker (/global/common/genepool_sl72/usg/languages/gcc/7.1.0/x86_64-pc-linux-gnu/bin/ld) is GNU ld... yes
checking for BSD- or MS-compatible name lister (nm)... /usr/common/usg/languages/gcc/7.1.0/bin/nm -B
checking the name lister (/usr/common/usg/languages/gcc/7.1.0/bin/nm -B) interface... BSD nm
checking the maximum length of command line arguments... 1572864
checking whether the shell understands some XSI constructs... yes
checking whether the shell understands "+="... yes
checking how to convert x86_64-unknown-linux-gnu file names to x86_64-unknown-linux-gnu format... func_convert_file_noop
checking how to convert x86_64-unknown-linux-gnu file names to toolchain format... func_convert_file_noop
checking for /global/common/genepool_sl72/usg/languages/gcc/7.1.0/x86_64-pc-linux-gnu/bin/ld option to reload object files... -r
checking for objdump... objdump
checking how to recognize dependent libraries... pass_all
checking for dlltool... no
checking how to associate runtime and link libraries... printf %s\n
checking for ar... ar
checking for archiver @FILE support... @
checking for strip... strip
checking for ranlib... ranlib
checking command to parse /usr/common/usg/languages/gcc/7.1.0/bin/nm -B output from gcc object... ok
checking for sysroot... no
checking for mt... mt
checking if mt is a manifest tool... no
checking how to run the C preprocessor... gcc -E
checking for ANSI C header files... yes
checking for sys/types.h... yes
checking for sys/stat.h... yes
checking for stdlib.h... yes
checking for string.h... yes
checking for memory.h... yes
checking for strings.h... yes
checking for inttypes.h... yes
checking for stdint.h... yes
checking for unistd.h... yes
checking for dlfcn.h... yes
checking for objdir... .libs
checking if gcc supports -fno-rtti -fno-exceptions... no
checking for gcc option to produce PIC... -fPIC -DPIC
checking if gcc PIC flag -fPIC -DPIC works... yes
checking if gcc static flag -static works... yes
checking if gcc supports -c -o file.o... yes
checking if gcc supports -c -o file.o... (cached) yes
checking whether the gcc linker (/global/common/genepool_sl72/usg/languages/gcc/7.1.0/x86_64-pc-linux-gnu/bin/ld -m elf_x86_64) supports shared libraries... yes
checking whether -lc should be explicitly linked in... no
checking dynamic linker characteristics... GNU/Linux ld.so
checking how to hardcode library paths into programs... immediate
checking whether stripping libraries is possible... yes
checking if libtool supports shared libraries... yes
checking whether to build shared libraries... yes
checking whether to build static libraries... yes
checking how to run the C++ preprocessor... g++ -E
checking for ld used by g++... /global/common/genepool_sl72/usg/languages/gcc/7.1.0/x86_64-pc-linux-gnu/bin/ld -m elf_x86_64
checking if the linker (/global/common/genepool_sl72/usg/languages/gcc/7.1.0/x86_64-pc-linux-gnu/bin/ld -m elf_x86_64) is GNU ld... yes
checking whether the g++ linker (/global/common/genepool_sl72/usg/languages/gcc/7.1.0/x86_64-pc-linux-gnu/bin/ld -m elf_x86_64) supports shared libraries... yes
checking for g++ option to produce PIC... -fPIC -DPIC
checking if g++ PIC flag -fPIC -DPIC works... yes
checking if g++ static flag -static works... yes
checking if g++ supports -c -o file.o... yes
checking if g++ supports -c -o file.o... (cached) yes
checking whether the g++ linker (/global/common/genepool_sl72/usg/languages/gcc/7.1.0/x86_64-pc-linux-gnu/bin/ld -m elf_x86_64) supports shared libraries... yes
checking dynamic linker characteristics... (cached) GNU/Linux ld.so
checking how to hardcode library paths into programs... immediate
checking for ANSI C header files... (cached) yes
checking for an ANSI C-conforming const... yes
checking for inline... inline
checking whether time.h and sys/time.h may both be included... yes
checking whether gcc needs -traditional... no
checking that generated files are newer than configure... done
configure: creating ./config.status
config.status: creating Makefile
config.status: creating libMUSCLE/Makefile
config.status: creating libMUSCLE-3.7.pc
config.status: executing depfiles commands
config.status: executing libtool commands
Making install in libMUSCLE
make[1]: Entering directory '/global/projectb/scratch/snayfach/projects/gut_mags/18_snp_trees/parsnp-1.2/muscle/libMUSCLE'
make[2]: Entering directory '/global/projectb/scratch/snayfach/projects/gut_mags/18_snp_trees/parsnp-1.2/muscle/libMUSCLE'
 /usr/bin/mkdir -p '/global/projectb/scratch/snayfach/projects/gut_mags/18_snp_trees/parsnp-1.2/muscle/lib'
 /bin/sh ../libtool   --mode=install /usr/bin/install -c   libMUSCLE-3.7.la '/global/projectb/scratch/snayfach/projects/gut_mags/18_snp_trees/parsnp-1.2/muscle/lib'
libtool: install: /usr/bin/install -c .libs/libMUSCLE-3.7.so.1.0.0 /global/projectb/scratch/snayfach/projects/gut_mags/18_snp_trees/parsnp-1.2/muscle/lib/libMUSCLE-3.7.so.1.0.0
libtool: install: (cd /global/projectb/scratch/snayfach/projects/gut_mags/18_snp_trees/parsnp-1.2/muscle/lib && { ln -s -f libMUSCLE-3.7.so.1.0.0 libMUSCLE-3.7.so.1 || { rm -f libMUSCLE-3.7.so.1 && ln -s libMUSCLE-3.7.so.1.0.0 libMUSCLE-3.7.so.1; }; })
libtool: install: (cd /global/projectb/scratch/snayfach/projects/gut_mags/18_snp_trees/parsnp-1.2/muscle/lib && { ln -s -f libMUSCLE-3.7.so.1.0.0 libMUSCLE-3.7.so || { rm -f libMUSCLE-3.7.so && ln -s libMUSCLE-3.7.so.1.0.0 libMUSCLE-3.7.so; }; })
libtool: install: /usr/bin/install -c .libs/libMUSCLE-3.7.lai /global/projectb/scratch/snayfach/projects/gut_mags/18_snp_trees/parsnp-1.2/muscle/lib/libMUSCLE-3.7.la
libtool: install: /usr/bin/install -c .libs/libMUSCLE-3.7.a /global/projectb/scratch/snayfach/projects/gut_mags/18_snp_trees/parsnp-1.2/muscle/lib/libMUSCLE-3.7.a
libtool: install: chmod 644 /global/projectb/scratch/snayfach/projects/gut_mags/18_snp_trees/parsnp-1.2/muscle/lib/libMUSCLE-3.7.a
libtool: install: ranlib /global/projectb/scratch/snayfach/projects/gut_mags/18_snp_trees/parsnp-1.2/muscle/lib/libMUSCLE-3.7.a
libtool: finish: PATH="/usr/common/usg/hpc/OFED/gnu7.1/4.0-2.0.0.1-Mellanox/bin:/usr/common/usg/languages/perl/5.24.0/extra/bin:/usr/common/usg/languages/perl/5.24.0/bin:/usr/common/usg/utility_libraries/libgd/gnu7.1/2.2.4/bin:/usr/common/usg/utility_libraries/libfreetype/gnu7.1/2.7.1/bin:/usr/common/usg/utility_libraries/libjpeg/gnu7.1/6b/bin:/usr/common/usg/utility_libraries/libpng/gnu7.1/1.6.28/bin:/usr/common/usg/languages/python/2.7-anaconda/bin:/usr/common/usg/utilities/mysql/5.7.19/bin:/usr/common/jgi/oracle_client/11.2.0.3.0/client_1/bin:/usr/common/usg/languages/java/jdk/oracle/1.8.0_31_x86_64/bin:/usr/common/usg/languages/gcc/7.1.0/bin:/usr/common/usg/bin:/usr/common/mss/bin:/usr/common/nsg/bin:/usr/syscom/nsg/bin:/usr/share/Modules/3.2.10/bin:/global/homes/s/snayfach/bin:/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/games:/opt/ibutils/bin:/global/projectb/scratch/snayfach/projects/gut_mags/18_snp_trees/parsnp-1.2:/sbin" ldconfig -n /global/projectb/scratch/snayfach/projects/gut_mags/18_snp_trees/parsnp-1.2/muscle/lib
----------------------------------------------------------------------
Libraries have been installed in:
   /global/projectb/scratch/snayfach/projects/gut_mags/18_snp_trees/parsnp-1.2/muscle/lib

If you ever happen to want to link against installed libraries
in a given directory, LIBDIR, you must either use libtool, and
specify the full pathname of the library, or use the `-LLIBDIR'
flag during linking and do at least one of the following:
   - add LIBDIR to the `LD_LIBRARY_PATH' environment variable
     during execution
   - add LIBDIR to the `LD_RUN_PATH' environment variable
     during linking
   - use the `-Wl,-rpath -Wl,LIBDIR' linker flag
   - have your system administrator add LIBDIR to `/etc/ld.so.conf'

See any operating system documentation about shared libraries for
more information, such as the ld(1) and ld.so(8) manual pages.
----------------------------------------------------------------------
 /usr/bin/mkdir -p '/global/projectb/scratch/snayfach/projects/gut_mags/18_snp_trees/parsnp-1.2/muscle/bin'
  /bin/sh ../libtool   --mode=install /usr/bin/install -c muscle '/global/projectb/scratch/snayfach/projects/gut_mags/18_snp_trees/parsnp-1.2/muscle/bin'
libtool: install: /usr/bin/install -c .libs/muscle /global/projectb/scratch/snayfach/projects/gut_mags/18_snp_trees/parsnp-1.2/muscle/bin/muscle
 /usr/bin/mkdir -p '/global/projectb/scratch/snayfach/projects/gut_mags/18_snp_trees/parsnp-1.2/muscle/include/libMUSCLE-3.7/libMUSCLE'
 /usr/bin/install -c -m 644 alpha.h cluster.h clust.h clustsetdf.h clustset.h clustsetmsa.h diaglist.h distcalc.h distfunc.h dpregionlist.h dpreglist.h edgelist.h enumopts.h enums.h estring.h gapscoredimer.h gonnet.h intmath.h msadist.h msa.h muscle.h objscore.h params.h profile.h pwpath.h refine.h scorehistory.h seq.h seqvect.h textfile.h timing.h tree.h types.h unixio.h threadstorage.h '/global/projectb/scratch/snayfach/projects/gut_mags/18_snp_trees/parsnp-1.2/muscle/include/libMUSCLE-3.7/libMUSCLE'
make[2]: Leaving directory '/global/projectb/scratch/snayfach/projects/gut_mags/18_snp_trees/parsnp-1.2/muscle/libMUSCLE'
make[1]: Leaving directory '/global/projectb/scratch/snayfach/projects/gut_mags/18_snp_trees/parsnp-1.2/muscle/libMUSCLE'
make[1]: Entering directory '/global/projectb/scratch/snayfach/projects/gut_mags/18_snp_trees/parsnp-1.2/muscle'
make[2]: Entering directory '/global/projectb/scratch/snayfach/projects/gut_mags/18_snp_trees/parsnp-1.2/muscle'
make[2]: Nothing to be done for 'install-exec-am'.
 /usr/bin/mkdir -p '/global/projectb/scratch/snayfach/projects/gut_mags/18_snp_trees/parsnp-1.2/muscle/lib/pkgconfig'
 /usr/bin/install -c -m 644 libMUSCLE-3.7.pc '/global/projectb/scratch/snayfach/projects/gut_mags/18_snp_trees/parsnp-1.2/muscle/lib/pkgconfig'
make[2]: Leaving directory '/global/projectb/scratch/snayfach/projects/gut_mags/18_snp_trees/parsnp-1.2/muscle'
make[1]: Leaving directory '/global/projectb/scratch/snayfach/projects/gut_mags/18_snp_trees/parsnp-1.2/muscle'
/usr/bin/perl: symbol lookup error: /usr/common/usg/languages/perl/5.24.0/extra/lib/perl5/x86_64-linux-thread-multi/auto/List/Util/Util.so: undefined symbol: Perl_xs_handshake
/usr/bin/perl: symbol lookup error: /usr/common/usg/languages/perl/5.24.0/extra/lib/perl5/x86_64-linux-thread-multi/auto/List/Util/Util.so: undefined symbol: Perl_xs_handshake
/usr/bin/perl: symbol lookup error: /usr/common/usg/languages/perl/5.24.0/extra/lib/perl5/x86_64-linux-thread-multi/auto/List/Util/Util.so: undefined symbol: Perl_xs_handshake
build_parsnp_linux.sh: line 7: ./configure: No such file or directory
make: *** No rule to make target 'install'.  Stop.

Parsnp error

Hello, I recently installed the Parsnp tool on my Ubuntu 18.04 LTS and when I tried to run the tool I ended up with the following error.

-->Reading Genome (asm, fasta) files from /home/preventiva/Desktop/Teste..
|->[OK]
-->Reading Genbank file(s) for reference (.gbk) ..
|->[WARNING]: no genbank file provided for reference annotations, skipping..
-->Calculating MUMi..
|->[OK]
-->Running Parsnp multi-MUM search and libMUSCLE aligner..
|->[OK]
-->Running PhiPack on LCBs to detect recombination..
|->[SKIP]
-->Reconstructing core genome phylogeny..
ERROR
The following command failed:

fasttree -nt -quote -gamma -slow -boot 100 /home/preventiva/P_2018_06_14_205550206988/parsnp.snps.mblocks > /home/preventiva/P_2018_06_14_205550206988/parsnp.tree
Please veryify input data and restart Parsnp. If the problem persists please contact the Parsnp development team.
ERROR
Am I doing something wrong? I've already checked the fasttree software and it is correctly installed.

Thank you for your help.
Gustavo Sambrano

No VCF file produced

Hi,

thanks a lot for this really interesting and useful software.
In the "Quick start" page of the documentation (http://harvest.readthedocs.org/en/latest/content/parsnp/quickstart.html) the presence of a VCF file is reported i the output directory, but when running the following command I haven't got such thing.

./parsnp -g NC_000913.gbk -r genomes/NC_000913.fasta -d genomes -p 10 -v -c

This is the content of the output directory:

all_mumi.ini         parsnpAligner.ini  parsnp.ggr   parsnp.xmfa
NC_000913.fasta.ref  parsnpAligner.log  parsnp.tree  psnn.ini

I'm interested in getting the SNPs of the input genomes as compared to the reference; if that information is stored elsewhere I would be happy anyway :)

Thanks a lot,
Marco

Order of arguments seems to matter

Hi,

It seems that parsnp requires it's arguments to be in a specific order. For example, if I use the -P argument before -o it writes the output to P_* directory instead of the directory specified by -o. It runs fine if I include the -P argument at the end.

-Mitchell

MUM distance matrix?

Is it possible to create a matrix of all pairwise MUM distances between a set of genomes with parsnp? I tried inserting the -M option in my usual parsnp command : "parsnp -g -d -p 8 -x -c -M -o ", but got this error:
Traceback (most recent call last):
File "", line 876, in
AttributeError: 'str' object has no attribute 'close'

Verify input data

Hi!

I'm running parsnp on mac 10.9.1 and I'm getting the error 'Please veryify input data and restart Parsnp. If the problem persists please contact the Parsnp development team.' I can't identify any issues with my input files?

Thanks in advance for any advice!

Here is the full command and stdout:

CIMIs-Mac-Pro:harvest-OSX64-v1.0.1 CIMI$ ./parsnp -r ~/Desktop/temp_forB58/LESB58.fasta -d ~/Desktop/Wider_PA_LES/ -p 8 -o ./test
|--Parsnp v1.0--|
For detailed documentation please see --> http://harvest.readthedocs.org/en/latest

-->Reading Genome (asm, fasta) files from /Users/CIMI/Desktop/Wider_PA_LES/..
|->[OK]
-->Reading Genbank file(s) for reference (.gbk) ..
|->[WARNING]: no genbank file provided for reference annotations, skipping..
-->Calculating MUMi..
|->[OK]
-->Running Parsnp multi-MUM search and libMUSCLE aligner..
ERROR
The following command failed:

/var/folders/23/zjs7hbts6mnbmslv8fdnzzx40000gn/T/_MEIrg7Jej/parsnp ./test/parsnpAligner.ini
Please veryify input data and restart Parsnp. If the problem persists please contact the Parsnp development team.
ERROR

issue with gbk file

Hello, I'm trying to assess parsnp as a tool to work with Listeria genomes, but I am facing this issue

|--Parsnp v1.2--|
For detailed documentation please see --> http://harvest.readthedocs.org/en/latest
ERROR: Genbank file ./media/ANALYSE/jmariet/tools/parsnp/Parsnp-Linux64-v1.2/data/tuto/ref/Lm_CC9_SLCC2479.gbk not found

I tried other answers proposed for former issues but this doesn't work.

Please could you help me to figure out what's going wrong?

regards

Random reference gets ".srt" added to its label

Two file "1.fa" and "2.fa" in dir 'fasta'.

Run "parsnp -r '!' -d fasta -o out"

Look at "out/parsnp.tree" ('2.fa.srt':0.0,'1.fa':0.0);

I assume the .srt suffix on the label refers to the one that was chosen as the reference?

Should it be there?

results: vcf file missing?

Hi!

I just downloaded parsnp (for linux), and ran the mers example. However, I do not get a vcf file as an output file:

karinlag@bl8vbox[parsnp_test] ../../bin/Parsnp-Linux64-v1.2/parsnp -g ./ref/EMC_2012.gbk -d ./mers49 -c [10:52AM]
|--Parsnp v1.2--|
For detailed documentation please see --> http://harvest.readthedocs.org/en/latest

-->Reading Genome (asm, fasta) files from ./mers49..
|->[OK]
-->Reading Genbank file(s) for reference (.gbk) ./ref/EMC_2012.gbk..
|->[OK]
-->Running Parsnp multi-MUM search and libMUSCLE aligner..
|->[OK]
-->Running PhiPack on LCBs to detect recombination..
|->[SKIP]
-->Reconstructing core genome phylogeny..
|->[OK]
-->Creating Gingr input file..
|->[OK]
-->Calculating wall clock time..
|->Aligned 49 genomes in 0.82 seconds

<<Parsnp finished! All output available in /home/karinlag/tmp/parsnp_test/P_2017_02_08_105223126989>>

Validating output directory contents...
1)parsnp.tree: newick format tree [OK]
2)parsnp.ggr: harvest input file for gingr (GUI) [OK]
3)parsnp.xmfa: XMFA formatted multi-alignment [OK]

karinlag@bl8vbox[parsnp_test]

Is there something I should be doing that I have forgotten?

Karin

parsnp.tree format vs newick format

Hi-

The parsnp.tree file I create has a scale bar of 0.05. But when I convert that to a newick formatted file with harvesttools, the scale becomes 5.0 E-5. I'm viewing both files with FigTree. Assuming that the scale is substitutions/site, can someone tell me why it is changing?

Thanks!

Glen

Genome file extensions not checked

When genomes are *.fna, parsnp indicates the genome files are '|->[OK]', however this does not check for the file extension expected (only *.fasta ?). Parsnp dies without explaining the file ext(s) are unrecognized as FastA format, instead printing 'IndexError: string index out of range'. A lucid error msg here would be helpful, perhaps indicating which extensions are accepted.

This might fix Issue #12.

Example:
$ parsnp -r ~/NC_013315.fna -g ~/NC_013315.gbk -c -u -v -d ~/PARSNP/027 -p 30 -o ~/PARSNP/027_out | tee -a ~/PARSNP/027/parsnp.log

|--Parsnp v1.2--|
For detailed documentation please see --> http://harvest.readthedocs.org/en/latest
-->Reading Genome (asm, fasta) files from /home/PARSNP/027..
|->[OK]
-->Reading Genbank file(s) for reference (.gbk) /home/NC_013315.gbk..
|->[OK]
Traceback (most recent call last):
File "", line 656, in
.******************************************************************************
SETTINGS:
|-refgenome: /home/NC_013315.fna
|-aligner: libMUSCLE
|-seqdir: /home/PARSNP/027
|-outdir: /home/PARSNP/027_out
|-OS: Linux
|-threads: 30
.******************************************************************************

IndexError: string index out of range

parsnp.vcf versus parsnp.snps

According to the docs, parsnp.vcf is used to infer the phylogenetic tree. If so, then why does it contain the snps I want to filter out near recombination sites (-x option)? If REC and LCB snps are present in parsnp.vcf, how is parsnp.snps created?

Cannot run parsnp with A. thaliana reference.

Hi, I am trying to utilize parsnp to detect SNPs in three A. thaliana genome assemblies.

I put three A. thaliana assemblies in the ./Genomes folder.
I expect to get results in ./Result folder.

parsnp -r ./TAIR10.fa -d ./Genomes -o ./Result

However, I get the following error. How could I proceed?

ERROR: ref genome sequence ./TAIR10.fa seems to aligned! remove and restart

IndexError: string index out of range

I notitced that this was an issue previously, but it was the result of too many arguments (-r and -g) I am getting the same error, but I'm only using -g. I've tried running this on linux, and High Sierra. and both give me the error message, not sure what I'm doing wrong. Here I am using the tutorial data and getting the error message.

Thanks,
Derreck

~/Parsnp/Parsnp-Linux64-v1.2$ ./parsnp -g ../ref/EMC_2012.gbk -d ../mers49/ -c
|--Parsnp v1.2--|
For detailed documentation please see --> http://harvest.readthedocs.org/en/latest

-->Reading Genome (asm, fasta) files from ../mers49/..
|->[OK]
-->Reading Genbank file(s) for reference (.gbk) ../ref/EMC_2012.gbk..
|->[OK]
Traceback (most recent call last):
File "", line 656, in
IndexError: string index out of range

Please add --version flag

Ideally a simple --version flag to help make it easy to bundle into Galaxy and Homebrew.

% parsnp --version
parsnp 1.1

Freebayes implementation missing?

The manuscript mentions, "Base quality is optionally determined using FreeBayes [54] to measure read support and mixed alleles."

I don't see that option in the v1.2 command line options. Also, I can't find a FreeBayes "wrapper" in the python script. In contrast, there are "run_phipack" and "run_fasttree" routines defined.

-c not recruiting all genomes

Parsnp seems to be leaving out about a quarter of my assemblies even with the -c flag. We are running assemblies from samples all of the same species, though of differing assembly qualities. Are there selection criteria for assemblies even with -c?

parsnp -g ../../NC_XXXXXX.gb -d LargeContigs_Mike/ -p 12 -o P_metag_isolates_150511/ -c

IndexError

Hi!

I'm getting the following error when executing parsnp on my directory of assemblies. Glancing at the code, it seems like it might have something to do with the header in my assemblies. Any ideas?

Header example:

contig_27 Flow: 1 Edge ( 1504526, 2275572) String Length: 11237 Coverage: 0

Error:

~/bin/Parsnp-Linux64-v1.2/parsnp -r ../../3050.fasta -d ../LargeContigs/ -v YES -p 12
|--Parsnp v1.2--|
For detailed documentation please see --> http://harvest.readthedocs.org/en/latest

SETTINGS:
|-refgenome: ../../2013T3050-joined.fasta
|-aligner: libMUSCLE
|-seqdir: ../LargeContigs_Mike/
|-outdir: /LargeContigs_Mike/P_2015_05_05_171008977035
|-OS: Linux
|-threads: 32

<>

-->Reading Genome (asm, fasta) files from ../LargeContigs/..
|->[OK]
-->Reading Genbank file(s) for reference (.gbk) ..
|->[WARNING]: no genbank file provided for reference annotations, skipping..
Traceback (most recent call last):
File "", line 656, in
IndexError: string index out of range

parsnip maf and vcf files

Anyone found a way to link the position of a SNP in the VCF file with the MAF alignment?

parsnp default options

How can I determine which FastTree, MUSCLE and PhiPack options were used when I run parsnp? I'm particularly interested in knowing which model of nucleotide evolution is used and what the unit of scale is in the output tree. I didn't see anything useful in the parsnpAligner.log or .ini files.

Thanks!

Glen

Parsnp requires 2 or more genomes to run, exiting

Hello,

I am using a fasta file as a reference and 2 denovo assembled contig files as my query. I know the query files have the reference with some SNPs. This is the command I am using

parsnp -q ./GCF_001077715.1_ASM107771v1_genomic.fna -d ./concat_seq/ -o test/ -p 12

My error log looks like this:
-->Reading Genome (asm, fasta) files from ./concat_seq/..
|->[OK]
-->Reading Genbank file(s) for reference (.gbk) ..
|->[WARNING]: no genbank file provided for reference annotations, skipping..
Parsnp requires 2 or more genomes to run, exiting
[]
I am not sure where the problem lies and any input on this will be of great help!

Thank you!

parsnp cannot find input sequences

Hello,

I was trying to do a simple test run with the following code:
parsnp -p 2 -d test
and got the following:
ERROR: no seqs provided, yet required. exit!

Does anyone know what went wrong?
Thanks!
Brooke

Symbol lookup error

I have an issue when I try to run parsnp with the tutorial data on Debian

|--Parsnp v1.2--|
For detailed documentation please see --> http://harvest.readthedocs.org/en/latest
******************************************************************************
SETTINGS:
|-refgenome:    ./ref/EMC_2012.gbk.fna
|-aligner:  libMUSCLE
|-seqdir:   ./mers49
|-outdir:   /home/mgonnet/Bureau/sandbox_parsnp/P_2016_05_09_110045190618
|-OS:       Linux
|-threads:  32
******************************************************************************

<<Parsnp started>>

-->Reading Genome (asm, fasta) files from ./mers49..
  |->[OK]
-->Reading Genbank file(s) for reference (.gbk) ./ref/EMC_2012.gbk..
  |->[OK]
RECRUITED GENOMES:


-->Running Parsnp multi-MUM search and libMUSCLE aligner..
/bin/bash: symbol lookup error: /lib/x86_64-linux-gnu/libncurses.so.5: undefined symbol: _nc_putchar
**ERROR**
The following command failed:
>>/tmp/_MEI2wP7ec/parsnp /home/mgonnet/Bureau/sandbox_parsnp/P_2016_05_09_110045190618/parsnpAligner.ini
Please veryify input data and restart Parsnp. If the problem persists please contact the Parsnp development team.
**ERROR**

Any idea to solve this problem?

Best regards

Failed to run parsnp: "Please veryify input data and restart Parsnp"

Hi,

I am trying to run parsnp without success. This is the error message:

$ parsnp -g my_genome.gb -d assembled_fasta -c -o parsnp_out -p 4
|--Parsnp v1.2--|
For detailed documentation please see --> http://harvest.readthedocs.org/en/latest
***************************
SETTINGS:
|-refgenome:	my_genome.gb.fna
|-aligner:	libMUSCLE
|-seqdir:	assembled_fasta
|-outdir:	parsnp_out
|-OS:		Darwin
|-threads:	4
***************************

<<Parsnp started>>

-->Reading Genome (asm, fasta) files from assembled_fasta..
  |->[OK]
-->Reading Genbank file(s) for reference (.gbk) my_genome.gb..
  |->[OK]
-->Running Parsnp multi-MUM search and libMUSCLE aligner..
**ERROR**
The following command failed:
>>/var/folders/12/_2dy5brs1jddqsl30v83dw140000gp/T/_MEIVmCQ0h/parsnp parsnp_out/parsnpAligner.ini
Please veryify input data and restart Parsnp. If the problem persists please contact the Parsnp development team.
**ERROR**

I have tried parsnp on biolinux and Mac OSX (El Capitan) with essentially the same error message. Is there something wrong with my input? they are draft assemblies in multi-fasta format (one file per genome).

Below is the content of parsnpAligner.ini

;Parsnp configuration File
;
[Reference]
file=EC958_HG941718.gb.fna
reverse=0
[Query]
file1=assembled_fasta/ESC_BA9493AA_AS.fasta
reverse1=0
file2=assembled_fasta/ESC_CA1422AA_AS.fasta
reverse2=0
...
file3389=assembled_fasta/ESC_NA8384AA_AS.fasta
reverse3389=0
file3390=assembled_fasta/ESC_AA7744AA_AS.fasta
reverse3390=0
[MUM]
anchors=1.0*(Log(S))
anchorfile=
anchorsonly=0
calcmumi=0
mums=1.1*(Log(S))
mumfile=
filter=1
factor=2.0
extendmums=0
[LCB]
recombfilter=0
cores=4
diagdiff=0.12
doalign=2
c=21
d=300
q=30
p=15000000
icr=0
unaligned=0
[Output]
outdir=parsnp_out
prefix=parsnp
showbps=1

Best regards,

annotating vcf output

Is there a way to annotate the vcf output from harvesttools such that the info column will be populated with the ORF locus tag that corresponds to the location of each SNP in the reference genbank file?

setup.py is refering to non-existing README file

Hi,

when trying to package parsnp for Debian I noticed that setup.py is refering to README. If you use README.md there everything is fine.

Kind regards

      Andreas.

parsnp -h should exit successfully

It exits with exit status 2. It should exit with 0.

❯❯❯ parsnp_OSX64_v1.0.1/parsnp -h >/dev/null; echo $?
|--Parsnp v1.0--|
For detailed documentation please see --> http://harvest.readthedocs.org/en/latest
2

Hyphens in reference's deflines are not permitted?

parsnp -v -c -o ../parsnp -r 16-901.fna -d $PWD -p 32 halts on "ERROR: ref genome sequence 16-901.fna seems to aligned! remove and restart"

All deflines for all genomes are named the same with "16-" being the prefix. Curiously, if I remove the hyphens only in the reference's deflines and leave hyphens in the deflines of all other genomes, the error message goes away and I get a parsnp.ggr.

There's no documentation on illegal characters in a reference's deflines for this that I'm aware of, but perhaps this could be fixed to only test for gaps in the sequences not deflines in the reference file.

A strange error message: Parsnp requires 2 or more genomes to run, exiting

Hello,

I am still figuring out how to run parsnp with my datasets. I have two questions.
(1) could parsnp handle larger genomes such as of A. thaliana and further of H. sapiens rather than those of microorganisms? If you let me know that parsnp does not deal with such large genomes, I could save time and efforts to run parsnp on my datasets.
(2) In my setting, Genomes folder contains 3 genome assemblies of A. thaliana and one of chromosome in A. thaliana genome is used to specify the parameter '-r'. In turn, the command line looks like as follows:

parsnp -r ./TAIR10_chr1.fas -d ./Genomes -o ./Result_chr1 -p 64

Although the Genomes folder contains 3 genome assemblies, parsnp does not recognize them.
Their file extension was .fa and I have tried to change it to .fna. Both extensions do not work.
Could you explain the reason why it does not work?

Euncheon

Parsnp ignores three genomes

Hi,

My reference is a chromosome and genome fold contains 8 assemblies. But when I run parsnp, it ignores three assemblies.
I checked the format of fasta files and they are correct. All assemblies are within +-30% of the reference.
No zero length sequence.
I want to know why it happened.

Thanks!

Static binary URLs fail, error 404

The static binaries on the README.md fail, but the ones on the Releases tab work.

GOOD:
https://github.com/marbl/parsnp/releases/download/v1.0/parsnp-OSX64-v1.0.tar.gz
BAD:
https://github.com/marbl/parsnp/releases/download/v1.0/parsnp-OSX64.gz

IndexError: string index out of range

Hello,
I'm trying to run parsnp but for some reason it throws an exception as you can see bellow.
I looked at the source file of parsnp but didn't have found any relevant clue regarding the cause of this problem.
Appreciate if you can assist me here.
Many thanks, Oren
P.s. usually I use parsnp to align sequences of course, however this time I only need a phylogenetic tree it outputs. So if there's a way I can ask it to only compute the tree for me that'll be fantastic.

$ parsnp -r data/NC_000913.fna -g data/NC_000913.gbk -p 8 -d data/ -o parsnp_ref_000913/ -v
|--Parsnp v1.0--|
For detailed documentation please see --> http://harvest.readthedocs.org/en/latest

-->Reading Genome (asm, fasta) files from data/..
|->[OK]
-->Reading Genbank file(s) for reference (.gbk) data/NC_000913.gbk..
|->[OK]
Traceback (most recent call last):
File "", line 905, in
IndexError: string index out of range

Feature request: allow Prokka formatted GenBank reference (no GI numbers)

It's explained GI numbers are essential in the GenBank reference file to specify a file with the '-g' arg here: http://harvest.readthedocs.io/en/latest/content/parsnp/quickstart.html#advanced-usage Specifying a Prokka generated GenBank file gives this error message:

-->Reading Genome (asm, fasta) files from genomes..
  |->[OK]
-->Reading Genbank file(s) for reference (.gbk) ref.gbk..
  |->[OK]
Parsnp requires 2 or more genomes to run, exiting

and the ref.gbk.fna automatically generated by parsnp extracts only one sequence record from the Prokka GenBank file with the defline being >gi| whitespace protein-seq

GI numbers are phased out by NCBI (https://www.ncbi.nlm.nih.gov/news/03-02-2016-phase-out-of-GI-numbers/). It would be a useful feature to allow Prokka formatted GenBank files (without GIs) as reference to enable all new genomes, where GIs will no longer be added, to serve as reference.

XMFA file format

After running parsnp with your test-mers data i get the parsnp.xmfa file.
Before each fasta sequence there is a header with some information like this:
">1:57-25332 + cluster2 s1:p57"
Where the first number denotes the sequence ID defined longer up in the file
the following after ":" is sequence start to("-") sequence end, "+" cluster-name combined
with cluster-number. I was wondering what "s1" and "p57" means?
It looks the number in front of the ":" always is the same whilst the number after ":p" always
is the same as the start position of thats sequence LCB.

Will this always be the case?

Runned with your mers data like this:
./parsnp -g mers49/refs/EMC_2012.gbk -d /mers49/mers49 -p 32

fastaheaders.txt

marbl / parsnp Goto Github PK

parsnp's People

Contributors

Stargazers

Watchers

Forkers

parsnp's Issues

Header example:

Error:

~/bin/Parsnp-Linux64-v1.2/parsnp -r ../../3050.fasta -d ../LargeContigs/ -v YES -p 12 |--Parsnp v1.2--| For detailed documentation please see --> http://harvest.readthedocs.org/en/latest

SETTINGS: |-refgenome: ../../2013T3050-joined.fasta |-aligner: libMUSCLE |-seqdir: ../LargeContigs_Mike/ |-outdir: /LargeContigs_Mike/P_2015_05_05_171008977035 |-OS: Linux |-threads: 32

Recommend Projects

Recommend Topics

Recommend Org

~/bin/Parsnp-Linux64-v1.2/parsnp -r ../../3050.fasta -d ../LargeContigs/ -v YES -p 12
|--Parsnp v1.2--|
For detailed documentation please see --> http://harvest.readthedocs.org/en/latest

SETTINGS:
|-refgenome: ../../2013T3050-joined.fasta
|-aligner: libMUSCLE
|-seqdir: ../LargeContigs_Mike/
|-outdir: /LargeContigs_Mike/P_2015_05_05_171008977035
|-OS: Linux
|-threads: 32