katholt / srst2 Goto Github PK

Short Read Sequence Typing for Bacterial Pathogens

License: Other

Python 91.62% Shell 2.02% R 6.36%

srst2's Introduction

SRST2

Short Read Sequence Typing for Bacterial Pathogens

This program is designed to take Illumina sequence data, a MLST database and/or a database of gene sequences (e.g. resistance genes, virulence genes, etc) and report the presence of STs and/or reference genes.

Authors - Michael Inouye, Harriet Dashnow, Bernie Pope, Ryan Wick, Kathryn Holt (University of Melbourne)

How to cite - The peer-reviewed open-access paper is available in Genome Medicine: http://genomemedicine.com/content/6/11/90

Story-behind-the-paper is here

Problems? Please post an issue here in github: https://github.com/katholt/srst2/issues.

To be notifed of updates, join the SRST2 google group at https://groups.google.com/forum/#!forum/srst2.

Current release

Installation

Basic usage - MLST

Basic usage - Resistance genes

All usage options

Input read formats and options

MLST Database format

Gene databases

Output files

Printing consensus sequences

More basic usage examples

Compile results from completed runs

Running lots of jobs and compiling results

Known issues

Generating SRST2-compatible clustered database from raw sequences

Using the VFBD Virulence Factor Database with SRST2

Preformatted databases for specialist typing with SRST2

Plotting output in R

Example - Shigella sonnei public data

Current release - v0.2.0 - July 28, 2016

Dependencies:

python (v2.7.5 or later)
scipy, numpy http://www.scipy.org/install.html
bowtie2 (v2.1.0 or later) http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
SAMtools v0.1.18 https://sourceforge.net/projects/samtools/files/samtools/0.1.18/ (NOTE: later versions can be used, but better results are obtained with v0.1.18, especially at low read depths (<20x))

Updates in current master branch (not yet in a release)

Added new AMR gene database CARD_v3.0.8_SRST2.fasta, a curated version of CARD v3.0.8 with some fixes and additions based on comparisons with the ARG-Annot, ResFinder and NCBI AMR databases. This was done my Margaret Lam for the Kleborate paper, see details of modifications vs CARD v3.0.8 here).
Added to readme some instructions for using SRST2 to serotype Group B Streptococcus (S. agalactiae) using the GBS-SBG database, available here - thanks to Swaine Chen and colleagues for this.

Updates in v0.2.0

Some improvements to allele calling, particularly for Klebsiella MLST locus mdh, kindly contributed by andreyto. Includes rejection of read alignments that are clipped on both ends (likely to be spurious) and minor bug fixes associated with depth calculations.
Updated E. coli serotype database to remove duplicate sequences.
Added mcr-2 colistin resistance gene to ARGannot.r1.fasta resistance gene database.
A --threads option was added, which makes SRST2 call Bowtie and Samtools with their threading options. The resulting speed up is mostly due to the Bowtie mapping step which parallelises very well.
The VFDB_cdhit_to_csv.py script was updated to work with the new VFDB FASTA format.
Versions of Bowtie2 up to 2.2.9 are now supported. Samtools v1.3 can now be used as well, however v0.1.18 is still the recommended version (for reasons discussed below).
Added scripts/qsub_srst2.py to generate SRST2 jobs for the Grid Engine (qsub) scheduling system (http://gridscheduler.sourceforge.net/). Thanks to Ramon Fallon from the University of St Andrews for putting this together. Some of the specifics are set up for his cluster, so modifications may be necessary to make it run properly on a different cluster using Grid Engine.
Various other small bug fixes!

Updates in v0.1.8

/data directory includes files for subtyping of the LEE pathogenicity island of E. coli, as per Ingle et al, 2016, Nature Microbiology. Instructions below
Resistance gene database updates:

Fixed ARGannot.r1.fasta to include proper mcr1 DNA sequence.
Added columns to the ARGannot_clustered80.csv table, to indicate classes of beta-lactamases included in the ARGannot.r1.fasta database according to the NCBI beta-lactamase resource (new location for the Lahey list).

Fixed some issues with handling of missing data (i.e. where there were no hits to MLST and/or no hits to genes) when compiling results into a table via --prev_output. This could result in misalignment of gene columns in previous versions.

Updates in v0.1.7

Use the following environment variables to specify your prefered samtools and bowtie2 executables (thanks to Ben Taylor for this):

SRST2_SAMTOOLS
SRST2_BOWTIE2
SRST2_BOWTIE2_BUILD

Added mcr1, the plasmid-borne colisting resistance gene to the included ARG-Annot-based resistance gene DB (ARGannot.r1.fasta)
Fixed a problem with writing consensus files that occurred when a directory structure was specified using --output (bug introduced in v0.1.6)

Updates in v0.1.6

The original validation of SRST2 (see paper) was performed with bowtie2 version 2.1.0 and samtools v0.1.18.

bowtie2: SRST2 has now been tested on the tutorial example and other test data sets using the latest versions of bowtie2, 2.2.3 and 2.2.4, which gave identical results to those obtained with bowtie2 v2.1.0. Therefore, the SRST2 code will now run if any of these versions of bowtie2 are available: 2.1.0, 2.2.3 or 2.2.4.
samtools: SRST2 has now been tested on the Staph & Salmonella test data sets used in the paper, and will work with newer samtools versions (tested up to v1.1). Note however that SRST2 still works best with samtools v0.1.18, due to small changes in the mapping algorithms in later versions that result in some loss of reads at the ends of alleles. This has most impact at low read depths, however we do recommend using v0.1.18 for optimum results.

Minor fixes to the ARG-Annot database of resistance genes, including removal of duplicate sequences and fixes to gene names (thanks to Wan Yu for this). Old version remains unchanged for backwards compatibility, but we recommend using the revised version (located in data/ARGannot.r1.fasta).
Added EcOH database for serotyping E. coli (thanks to Danielle Ingle for this). See Using the EcOH database for serotyping E. coli with SRST2 and this MGen paper.
Fixed a problem where, when analysing multiple read sets in one SRST2 call against a gene database in which cluster ids don't match gene symbols, individual gene clusters appear multiple times in the output. The compile function was unaffected and remains unchanged.
Fixed behaviour so that including directory paths in --output parameter works (thanks to nyunyun for contributing most of this fix). E.g. --output test_dir/test will create output files prefixed with test, located in test_dir/, and all SRST2 functions should work correctly including consensus allele calling. If test_dir/ doesn't exist, we attempt to create it; if this is not possible the user is alerted and SRST2 stops.
Fixed problem when using a gene database with a simple fasta header (ie not clustered for SRST2; note best results are achieved by pre-clusering your sequence database beforehand) (thanks to cglambert for this one).
Fixes contributed by ppcherng (thanks!):

Fixed KeyErrors that occured when a given seqID was not found in the seq2cluster dictionary, which tended to happen if the FASTA file (gene database) contained empty entries that only have a header and no sequence.
Note v0.1.5 included addition of ppcherng's utility scripts to help automate creation of SRST2-compatible gene databases from VFDB.

Added new parameter --samtools_args to pass additional options to samtools mpileup (e.g. SionBayliss requested this in order to use -A option in samtools mpileup to include anomalous reads).
Fixed problem with consensus sequence reporting of truncated alleles (issue #39).
Added basic instructions for the R scripts provided for plotting output. See Plotting output in R

Updates in v0.1.5

Optionally switch on reporting of pileups and consensus sequences (fasta) for novel alleles (--report_new_consensus) or for all alleles (--report_all_consensus). See Printing consensus sequences
Post-process consensus sequences from a set of strains, to generate one file per locus containing all/new consensus sequences. See Collate consensus sequences
Some enhancements to getmlst.py script to handle some more unusual scheme names (force download of specific schemes that have non-unique names, handle forward slashes in names).
Fixed an issue where, if multiple readsets analysed in serial in a SRST2 run, the fullgenes report would only contain the results for the last readset. Fullgenes report now contains gene output for all readsets.
Added option (--merge_paired) to accommodate cases where users have multiple read sets for the same sample. If this flag is used, SRST2 will assume that all the input reads belong to the same sample, and outputs will be named as [prefix]__combined.xxx, where SRST2 was run using --output [prefix]. If the flag is not used, SRST2 will operate as usual and assume that each read pair is a new sample, with output files named as [prefix]__[sample].xxx, where [sample] is taken from the base name of the reads fastq files. Note that if you have lots of multi-run read sets to analyse, the ease of job submission will depend heavily on how your files are named and you will need to figure out your own approach to manage this (ie there is no way to submit multiple sets of multiple reads).
The original validation of SRST2 (see paper) was performed with bowtie2 version 2.1.0. SRST2 has now been tested on the tutorial example using the latest versions of bowtie2, 2.2.3 and 2.2.4, which gave identical results to those obtained with bowtie2 v2.1.0. Therefore, the SRST2 code will now run if any of these versions of bowtie2 are available: 2.1.0, 2.2.3 or 2.2.4. (Note however that there are still incompatibilities with the recent release of samtools, so you will need to stick to samtools v0.1.18 unless you want to modify the SRST2 code to allow later versions, and are happy with a dramatic loss in accuracy!)
Thanks to ppcherng for adding utility scripts to help automate creation of SRST2-compatible gene databases from VFDB.

Updates in v0.1.4

No longer store sam and unsorted bam (can be retained via the --keep_interim_alignment flag)
Added options to specify a maximum number of mismatches to allow during mapping; this is specified separately for mlst and genes, so that it is possible to relax the stringency of gene detection in the same run as a high-accuracy MLST test.

--mlst_max_mismatch
--gene_max_mismatch
Default value for both is 10 mismatches.

The highest minor allele frequency (MAF) of variants encountered in the alignment is now calculated and reported for each allele (in the scores file) and also at the gene level and ST level, to facilitate checking for mixed/contaminated read sets. This value is in the range 0 -> 0.5; with e.g. 0 indicating no variation between reads at any aligned base (i.e. at all positions in the alignment, all aligned reads agree on the same base call; although this agreed base may be different from the reference); and 0.25 indicating there is at least one position in the alignment at which all reads do not agree, and the least common variant (either match or mismatch to the reference) is present in 25% of reads. This value is printed, for all alleles, to the scores file. Note this is different to the ‘LeastConfident’ information printed to scores, which presents the strongest evidence for mismatch compared to the reference, i.e. between 0 -> 1. The highest such value for each gene/cluster/locus is reported in the fullgenes output table. The highest such value across all MLST loci is reported in the mlst output table. Note that all compiled reports will now include a maxMAF column; if you provide MLST or compiled reports from previous versions without this columns, the value “NC” will be inserted in the maxMAF column to indicate “not calculated”. This ensures the updated SRST2 (v0.1.4+) is backwards compatible with previous SRST2 outputs; do be aware though that the older versions of SRST2 (<v0.1.4) will not be forwards-compatible with output generated by more recent versions (v0.14 onwards).
Added R code for plotting SRST2 output in R (plotSRST2data.R). Instructions will be added to the read me.
Added formatted versions of the ARG-Annot resistance gene database, PlasmidFinder database and 18 plasmid replicon sequences to the /data directory. See /data/README.md for details and citations. It is recommended to use ARGannot.r1.fasta for detection of acquired resistance genes.

Updates in v0.1.3

Fixed a bug that occurred while trying to type genes from a user-supplied database (see issue #5, thanks to Scott Long)
Fixed a bug in gene detection reporting - genes are now correctly reported by cluster, rather than by gene symbol (see issue #7)
Added maximum divergence option for reporting (--max_divergence), default is now to report only hits with <10% divergence from the database (see issue #8)
added parameter to pass to bowtie2 parameter -u N to stop mapping after the first N reads. Default behaviour remains to map all reads. However, for large read sets (e.g. >100x), extra reads do not help and merely increase the time taken for mapping and scoring, and you may want to limit to the first million reads (100x of a 2 Mbp genome) using --stop_after 1000000.

Installation

1 - Install dependencies

python (v2.7.5)
scipy http://www.scipy.org/install.html
bowtie2 v2.1.0 http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
SAMtools v0.1.18 https://sourceforge.net/projects/samtools/files/samtools/0.1.18/ (NOTE 0.1.19 DOES NOT WORK)

N.B. If you have multiple versions of samtools or bowtie2 installed, you can pick which one srst2 or slurm_srst2 should use by setting the following environment variables.

SRST2_SAMTOOLS
SRST2_BOWTIE2
SRST2_BOWTIE2_BUILD

If these aren't set or are missing, they will default to looking in your PATH for samtools, bowtie2 and bowtie2-build. The exception is SRST2_BOWTIE2_BUILD which, if it is not set or missing, will try adding -build to SRST2_BOWTIE2 if it exists, otherwise it defaults to looking in your PATH

2 - Get and install the code

Make sure you have installed git and pip.

Clone the git repository: git clone https://github.com/katholt/srst2

and then install with pip: pip install srst2/

OR do both at once: sudo pip install git+https://github.com/katholt/srst2

3 - Test that the programs are installed properly

srst2 --version
getmlst.py -h
slurm_srst2.py -h

The downloaded directory also contains things that might be useful for SRST2 users:

data/ contains a databases for resistance genes and plasmids
database_clustering/ contains scripts and instructions for formatting of other gene databases for use with SRST2
example.txt contains a tutorial/example on running SRST2 using public data

Basic usage - MLST

1 - Gather your input files

(i) sequence reads (this example uses paired reads in gzipped fastq format, see below for options)

(ii) a fasta sequence database to match to. For MLST, this means a fasta file of all allele sequences. If you want to assign STs, you also need a tab-delim file which defines the ST profiles as a combination of alleles. You can retrieve these files automatically from pubmlst.org/data/ using the script provided:

getmlst.py --species 'Escherichia coli#1'

2 - Run MLST

srst2 --input_pe strainA_1.fastq.gz strainA_2.fastq.gz --output strainA_test --log --mlst_db Escherichia_coli#1.fasta --mlst_definitions profiles_csv --mlst_delimiter _

3 - Check the outputs

(i) MLST results are output in: strainA_test__mlst__Escherichia_coli#1__results.txt

Sample	ST	adk	fumC	gyrB	icd	mdh	purA	recA	mismatches	uncertainty	depth	maxMAF
strainA	152	11	63	7	1	14	7	7	0	-	25.8319955826	0.125

Basic usage - Resistance genes

1 - Gather your input files

(i) sequence reads (this example uses paired reads in gzipped fastq format, see below for options)

(ii) a fasta sequence database to match to. For resistance genes, this means a fasta file of all the resistance genes/alleles that you want to screen for, clustered into gene groups. Some suitable databases are distributed with SRST2 (in the /data directory); we recommend using /data/CARD_v3.0.8_SRST2.fasta for acquired resistance genes (note this latest version has been added since the last SRST2 release so you may need to download it directly from the /data directory in this github repository).

2 - Run gene detection

srst2 --input_pe strainA_1.fastq.gz strainA_2.fastq.gz --output strainA_test --log --gene_db CARD_v3.0.8_SRST2.fasta

3 - Check the outputs

(i) Gene detection results are output in: strainA_test__genes__resistance__results.txt

Sample	aadA	dfrA	sul2	tet(B)
strainA	aadA1-5	dfrA1_1	sul2_2	tet(B)_4

All usage options

srst2 -h

SRST2 - Short Read Sequence Typer (v2)

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  --input_se INPUT_SE [INPUT_SE ...]
                        Single end read file(s) for analysing (may be gzipped)
  --input_pe INPUT_PE [INPUT_PE ...]
                        Paired end read files for analysing (may be gzipped)
  --merge_paired        Switch on if all the input read sets belong to a
                        single sample, and you want to merge their data to get
                        a single result
  --forward FORWARD     Designator for forward reads (only used if NOT in
                        MiSeq format sample_S1_L001_R1_001.fastq.gz; otherwise
                        default is _1, i.e. expect forward reads as
                        sample_1.fastq.gz)
  --reverse REVERSE     Designator for reverse reads (only used if NOT in
                        MiSeq format sample_S1_L001_R2_001.fastq.gz; otherwise
                        default is _2, i.e. expect forward reads as
                        sample_2.fastq.gz
  --read_type {q,qseq,f}
                        Read file type (for bowtie2; default is q=fastq; other
                        options: qseq=solexa, f=fasta).
  --mlst_db MLST_DB     Fasta file of MLST alleles (optional)
  --mlst_delimiter MLST_DELIMITER
                        Character(s) separating gene name from allele number
                        in MLST database (default "-", as in arcc-1)
  --mlst_definitions MLST_DEFINITIONS
                        ST definitions for MLST scheme (required if mlst_db
                        supplied and you want to calculate STs)
  --mlst_max_mismatch MLST_MAX_MISMATCH
                        Maximum number of mismatches per read for MLST allele
                        calling (default 10)
  --gene_db GENE_DB [GENE_DB ...]
                        Fasta file/s for gene databases (optional)
  --no_gene_details     Switch OFF verbose reporting of gene typing
  --gene_max_mismatch GENE_MAX_MISMATCH
                        Maximum number of mismatches per read for gene
                        detection and allele calling (default 10)
  --min_coverage MIN_COVERAGE
                        Minimum %coverage cutoff for gene reporting (default
                        90)
  --max_divergence MAX_DIVERGENCE
                        Maximum %divergence cutoff for gene reporting (default
                        10)
  --min_depth MIN_DEPTH
                        Minimum mean depth to flag as dubious allele call
                        (default 5)
  --min_edge_depth MIN_EDGE_DEPTH
                        Minimum edge depth to flag as dubious allele call
                        (default 2)
  --prob_err PROB_ERR   Probability of sequencing error (default 0.01)
  --truncation_score_tolerance TRUNCATION_SCORE_TOLERANCE
                        % increase in score allowed to choose non-truncated
                        allele
  --stop_after STOP_AFTER
                        Stop mapping after this number of reads have been
                        mapped (otherwise map all)
  --other OTHER         Other arguments to pass to bowtie2 (must be escaped,
                        e.g. "\--no-mixed".
  --max_unaligned_overlap MAX_UNALIGNED_OVERLAP
                        Read discarded from alignment if either of its ends
                        has unaligned overlap with the reference that is
                        longer than this value (default 10)
  --mapq MAPQ           Samtools -q parameter (default 1)
  --baseq BASEQ         Samtools -Q parameter (default 20)
  --samtools_args SAMTOOLS_ARGS
                        Other arguments to pass to samtools mpileup (must be
                        escaped, e.g. "\-A").
  --output OUTPUT       Prefix for srst2 output files
  --log                 Switch ON logging to file (otherwise log to stdout)
  --save_scores         Switch ON verbose reporting of all scores
  --report_new_consensus
                        If a matching alleles is not found, report the
                        consensus allele. Note, only SNP differences are
                        considered, not indels.
  --report_all_consensus
                        Report the consensus allele for the most likely
                        allele. Note, only SNP differences are considered, not
                        indels.
  --use_existing_bowtie2_sam
                        Use existing SAM file generated by Bowtie2 if
                        available, otherwise they will be generated
  --use_existing_pileup
                        Use existing pileups if available, otherwise they will
                        be generated
  --use_existing_scores
                        Use existing scores files if available, otherwise they
                        will be generated
  --keep_interim_alignment
                        Keep interim files (sam & unsorted bam), otherwise
                        they will be deleted after sorted bam is created
  --threads THREADS     Use multiple threads in Bowtie and Samtools
  --prev_output PREV_OUTPUT [PREV_OUTPUT ...]
                        SRST2 results files to compile (any new results from
                        this run will also be incorporated)

Input read formats and options

Any number of readsets can be provided using --input_se (for single end reads) and/or --input_pe (for paired end reads). You can provide both types of reads at once. Note however that if you do this, each readset will be typed one by one (in serial). So if it takes 2 minutes to type each read set, and you provide 10 read sets, it will take 20 minutes to get your results. The better way to proces lots of samples quickly is to give each one its own SRST2 job (e.g. submitted simultaneously to your job scheduler or server); then compile the results into a single report using srst2 --prev_output *results.txt --output all. That way each readset's 2 minutes of analysis is occurring in parallel on different nodes, and you'll get your results for all 10 samples in 2 minutes rather than 20.

Read formats

Reads can be in any format readable by bowtie2. The format is passed on to the bowtie2 command via the --read_type flag in SRST2. The default format is fastq (passed to bowtie 2 as q); other options are qseq=solexa, f=fasta. So to use fasta reads, you would need to tell SRST2 this via --read_type f.

Reads may be gzipped.

Read names

SRST2 can parse Illumina MiSeq reads files; we assume that files with names in the format XXX_S1_L001_R1_001.fastq.gz and XXX_S1_L001_R2_001.fastq.gz are the forward and reverse reads from a sample named "XXX". So, you can simply use srst2 --input_pe XXX_S1_L001_R1_001.fastq.gz XXX_S1_L001_R2_001.fastq.gz and SRST2 will recognise these as forward and reverse reads of a sample named "XXX". If you have single rather than paired MiSeq reads, you would use srst2 --input_se XXX_S1_L001_R1_001.fastq.gz.

Paired reads

If you have paired reads that are named in some way other than the Illumina MiSeq format, e.g. from the SRA or ENA public databases, you need to tell SRST2 how to pass these to bowtie2. bowtie2 requires forward and reverse reads to be supplied in separate files, e.g strainA_1.fastq.gz and strainA_2.fastq.gz. SRST2 attempts to sort reads supplied via --input_pe into read pairs, based on the suffix (_1, _2 in this example) that occurs before the file extension (.fastq.gz in this example). So if you supplied --input_pe strainA_1.fastq.gz strainB_1.fastq.gz strainA_2.fastq.gz strainB_2.fastq.gz, SRST2 would sort these into two pairs (strainA_1.fastq.gz, strainA_2.fastq.gz) and (strainB_1.fastq.gz, strainB_2.fastq.gz) and pass each pair on to bowtie2 for mapping. By default, the suffixes are assumed to be _1 for forward reads and _2 for reverse reads, but you can tell SRST2 if you have other conventions, via --forward and --reverse. E.g. if your files were named strainA_read1.fastq.gz and strainA_read2.fastq.gz, you would use these commands: --input_pe strainA_read1.fastq.gz strainA_read2.fastq.gz --forward _read1 --reverse _read2.

Sample names

Sample names are taken from the first part of the read file name (before the suffix if you have paired reads). E.g. strainA_1.fastq.gz is assumed to belong to a sample called "strainA"; strainB_C_1.fastq.gz would be assumed to belong to a sample called "strainB_C". These sample names will be used to name all output files, and will appear in the results files.

If you have multiple read sets for the same sample

The flag --merge_paired tells SRST2 to assume that all the input reads belong to the same sample. Outputs will be named as [prefix]__combined.xxx, where SRST2 was run using --output [prefix]. If this flag is not used, SRST2 will operate as usual and assume that each read pair is a new sample, with output files named as [prefix]__[sample].xxx, where [sample] is taken from the base name of the reads fastq files. Note that if you have lots of multi-run read sets to analyse, the ease of job submission will depend heavily on how your files are named and you will need to figure out your own approach to manage this (ie there is no way to submit multiple sets of multiple reads).

MLST Database format

MLST databases are specified by allele sequences, and a profiles table. These can be downloaded from the public databases, ready to use with SRST2, using the provided script getmlst.py (see above).

Allele sequences file, fasta format.

--mlst_db alleles.fasta

This should contain ALL allele sequences for the MLST scheme; i.e. if there are 7 loci in the scheme, then the sequences of all alleles from all 7 loci should appear together in this file. If you have one file per locus, just cat these together first. If you use getmlst.py, this is done for you.

--mlst_delimiter '-'

The names of the alleles (i.e. the fasta headers) are critical for a functioning MLST scheme and therefore for correct calling of STs. There are two key components to every fasta header: the name of the locus (e.g. in E. coli these are adk, fumC, gyrB, icd, mdh, purA, recA) and the number assigned to the allele (1, 2, 3 etc). These are usually separated by a delimiter like '-' or ''; e.g. in the E. coli scheme, alleles are named adk-1, fumC-10, etc. For ST calling to work properly, SRST2 needs to know what the delimiter is. By default, we assume it is '-' as this is the most common; however some schemes use '' (e.g. in the C. difficile scheme, the first allele is 'adk_1', so you would need to set --mlst_delimiter '_' on the SRST2 command line). If you use getmlst.py, it will remind you of this and try to guess for you what the most likely delimiter is.

MLST definitions file, tab delimited format.

--mlst_definitions

This is the file that tells you the ST number that is assigned to known combinations of alleles. Column 1 is the ST, and subsequent columns are the loci that make up the scheme. The names of these loci must match the allele names in the sequences database, e.g. adk, fumC, gyrB, icd, mdh, purA, recA in the E. coli scheme. If you download data from pubmlst this should not be a problem. Sometimes there are additional columns in this file, e.g. a column assigning STs to clonal complexes. SRST2 will ignore any columns that don't match gene names found in the allele sequences file.

Gene databases

In addition to MLST, SRST2 can do gene/allele detection. This works by mapping reads to each of the reference sequences in a fasta file(s) (provided through --gene_db) and reporting details of all genes that are covered above a certain level (--min_coverage, 90% by default).

If the input database contains different alelles of the same gene, SRST2 can report just the best matching allele for that gene (much like with MLST we report the best matching allele for each locus in the scheme). This makes the output manageable, as you will get one column per gene/locus (e.g. blaCTX-M) which reports the specific allele that was called in each sample (e.g. blaCTX-M-15 in sample A, blaCTX-M-13 in sample B).

We have provided some databases of resistance genes, plasmid genes, and serotyping for E. coli and Group B Streptococcus (see Preformatted databases for specialist typing with SRST2 below for details on using these databases).

You can however format any sequence set for screening with SRST2. See instructions below.

Output files

MLST results

If MLST sequences and profiles were provided, STs will be printed in tab-delim format to a file called [outputprefix]__mlst__[db]__results.txt, e.g.: strainArun1__mlst__Escherichia_coli__results.txt.

The format looks like this:

Sample	ST	adk	fumC	gyrB	icd	mdh	purA	recA	mismatches	uncertainty	depth	maxMAF
strainA	1502	6	63	7	1	14	7	7	0	-	12.3771855735	0.275

Each locus has a column in which the best scoring allele number is printed.

* indicates the best scoring allele has >=1 mismatch (SNP or indel, according to majority rules consensus of the aligned reads vs the allele sequence). Details of the mismatches are given in the mismatches column. This often means you have a novel allele.

? indicates uncertainty in the result because the best scoring allele has some low-depth bases; either the the first or last 2 bases of the allele had <N reads mapped, or a truncation was called in which neigbhbouring bases were coverd with <N reads, or the average depth across the whole allele was <X. N is set by the parameter --min_edge_depth (default 2), X is set by --min_depth (default 5). The source of the uncertainty is printed to the uncertainty column.

- indicates that no allele could be assigned (generally means there were no alleles that were >90% covered by reads)

If the combination of these alleles appears in the ST definitions file provided by --mlst_definitions, this ST will be printed in the ST column. "NF" indicates the allele combination was not found; "ND" indicates ST calculations were not done (because no ST definitions were provided). Here, * next to the ST indicates that there were mismatches against at least one of the alleles. This suggests that you have a novel variant of this ST rather than a precise match to this ST. ? indicates that there was uncertainty in at least one of the alleles. In all cases, the ST is calculated using the best scoring alleles, whether or not there are mismatches or uncertainty in those calls.

The mismatches column gives details of any mismatches (defined by majority rules consensus of the aligned reads vs the allele sequence) against the top scoring allele that is reported in the corresponding locus column. Possibilities are: (i) snps, adk-1/1snp indicates there was 1 SNP against the adk-1 allele; (ii) indels, adk-1/2indel indicates there were 1 indels (insertion or deletion calls in the alignment); (iii) holes, adk-1/5holes indicates there were 5 sections of the allele sequence that were not covered in the alignment and pileup (e.g. due to truncation at the start or end of the gene, or large deletions within the gene).

The uncertainty column gives details of parts of the top scoring alleles for which the depth of coverage was too low to give confidence in the result, this may be zero or any number up to the specified cutoff levels set via --min_edge_depth and --min_depth. Possibilities are considered in this order: (i) edge depth, adk-1/edge1.0 indicates that the mean read depth across either the first 2 or last 2 bases of the assigned allele was 1.0; this is monitored particularly because coverage at the ends of the allele sequences is dependent on bowtie2 to properly map reads that overhang the ends of the allele sequences, which is not as confident as when the whole length of a read maps within the gene (reported if this value is below the cutoff specified (default --min_edge_depth 2), low values can be interpreted as indicating uncertainty in the result as we can’t confidently distinguish alleles that differ at these low-covered bases); (ii) truncations, adk-1/del1.0 indicates that a truncation or large deletion was called for which the neighbouring 2 bases were covered to depth 1.0, this can be interpreted as indicating there is only very weak evidence for the deletion, as it is likely just due to random decline in coverage at this point (reported if this value is below the cutoff specified (default --min_edge_depth 2); (iii) average depth, adk-1/depth3.5 indicates that the mean read depth across the length of the assigned allele was 3.5 (reported if this value is below the cutoff specified, which by default is --min_depth 5)

The depth column indicates the mean read depth across the length of all alleles which were assigned a top scoring allele number (i.e. excluding any which are recorded as '-'). So if there are 7 loci with alleles called, this number represents the mean read depth across those 7 loci. If say, 2 of the 7 alleles were not called (recorded as ‘-’), the mean depth is that of the 5 loci that were called.

The maxMAF column reports the highest minor allele frequency (MAF) of variants encountered across the MLST locus alignments. This value is in the range 0 -> 0.5; with e.g. 0 indicating no variation between reads at any aligned base (i.e. at all positions in the alignment, all aligned reads agree on the same base call; although this agreed base may be different from the reference); and 0.25 indicating there is at least one position in the alignment at which all reads do not agree, and the least common variant (either match or mismatch to the reference) is present in 25% of reads. This value is printed, for all alleles, to the scores file; in the MLST file, the value reported is the highest MAF encountered across the MLST loci.

Gene typing

Gene typing results files report the details of sequences provided in fasta files via --genes_db that are detected above the minimum %coverage theshold set by --min_coverage (default 90).

Two output files are produced:

1 - A detailed report, [outputprefix]__fullgenes__[db]__results.txt, with one row per gene per sample:

Sample	DB	gene	allele	coverage	depth	diffs	uncertainty	cluster	seqid
strainA	resistance	dfrA	dfrA1_1	100.0	6.79368421053		edge1.5	590	137
strainA	resistance	aadA	aadA1-5	100.0	10.6303797468			162	1631
strainA	resistance	sul2	sul2_9	100.0	79.01992966			265	1763
strainA	resistance	blaTEM	blaTEM-1_5	100.0	70.8955916473			258	1396
strainA	resistance	tet(A)	tet(A)_4	97.6666666667	83.5831202046	28holes	edge0.0	76	1208
strainB	resistance	strB	strB1	100.0	90.0883054893			282	1720
strainB	resistance	strA	strA4	100.0	99.0832298137			325	1142

coverage indicates the % of the gene length that was covered (if clustered DB, then this is the highest coverage of any members of the cluster)

uncertainty is as above

2 - A tabulated summary report of samples x genes, [outputprefix]__genes__[db]__results.txt:

Sample	aadA	blaTEM	dfrA	strA	strB	sul2	tet(A)
strainA	aadA1-5	blaTEM-1_5	dfrA1_1?	-	-	sul2_9	tet(A)_4*?
strainB	-	-	-	strA4	strB1	-	-

The first column indicates the sample name, all other columns report the genes/alleles that were detected in the sample set. If multiple samples were input, or if previous outputs were provided for compiling results, then all the genes detected in ANY of the samples will have their own column in this table.

If you were using a clustered gene database (such as the ARGannot_r3.fasta database provided), the name of each cluster (i.e. the basic gene symbol) will be printed in the column headers, while specific alleles will be printed in the sample rows.

* indicates mismatches

? indicates uncertainty due to low depth in some parts of the gene

- indicates the gene was not detected (> %coverage threshold, --min_coverage 90)

Combined results

If more then one database is provided for typing (via --mlst_db and/or --gene_db), or if previous results are provided for merging with the current run which contain data from >1 database (via --prev_output), then an additional table summarizing all the database results is produced. This is named [outputprefix]__compiledResults.txt and is a combination of the MLST style table plus the tabulated gene summary (file 2 above).

Sample	ST	adk	fumC	gyrB	icd	mdh	purA	recA	mismatches	uncertainty	depth	maxMAF	aadA	blaTEM	dfrA	strA	strB	sul2	tet(A)
sampleA	152*	11	63*	7	1	14	7	7	0	-	21.3	0.05	aadA1-5	blaTEM-1_5	dfrA1_1?	strA4	strB1	sul2_9	tet(A)_4*?

Mapping results

The bowtie2 alignment of reads to each input database is stored in [outputprefix]__[sample].[db].sorted.bam and the samtools pileup of the alignment is stored in [outputprefix]__[sample].[db].pileup.

If you used --save_scores, a table of scores for each allele in the database is printed to [outputprefix]__[sample].[db].scores.

Printing consensus sequences

SRST2 can optionally report consensus sequences & pileups for novel alleles, or for all alleles. Note that only SNPs are included in the consensus fasta sequence as these can be reliably extracted from the read alignments; if there are indels (reported in the output tables) these will not be included in the consensus fasta file and we suggest you assembly the mapped reads (these are in the bam file).

By default, no allele sequences are generated, the results are simply tabulated.

Report consensus sequences for novel alleles

--report_new_consensus

For all samples and loci where the top scoring allele contains SNPs:

a pileup file will be generated for the top scoring allele, with the name [allele].[output]__[readset].[database].pileup
the consensus sequence will be printed to a fasta file with the name [output].new_consensus_alleles.fasta
fasta headers will be in the format >[allele].variant [sample]

IN ADDITION TO THE NOVEL ALLELES FILE OUTLINED ABOVE, the following will ALSO occur:

a pileup file will be generated for the top scoring allele for each sample at each locus, with the name [allele].[output]__[readset].[database].pileup
the consensus sequence for the top scoring allele for each sample at each locus will be printed to a fasta file with the name [output].all_consensus_alleles.fasta
fasta headers will be in the format >[allele].variant [sample]

Report consensus sequences for all alleles

--report_all_consensus

Collate consensus sequences output by SRST2 run on multiple strains & loci, into one file per locus

For genes:

python srst2/scripts/consensus_alignment.py --in *.all_consensus_alleles.fasta --pre test --type gene

For mlst:

python srst2/scripts/consensus_alignment.py --in *.all_consensus_alleles.fasta --pre test --type mlst --mlst_delimiter _

More basic usage examples

Run single read sets against MLST database and resistance database

srst2 --input_pe pool11_tag2_1.fastq.gz pool11_tag2_2.fastq.gz 
	--output pool11_tag2_Shigella --log 
	--gene_db CARD_v3.0.8_SRST2.fasta 
	--mlst_db Escherichia_coli.fasta 
	--mlst_definitions ecoli.txt

Run against multiple read sets in serial

srst2 --input_pe *.fastq.gz
	--output Shigella --log
	--gene_db CARD_v3.0.8_SRST2.fasta 
	--mlst_db Escherichia_coli.fasta 
	--mlst_definitions ecoli.txt

Run against new read sets, merge with previous reports (individual or compiled)

	srst2 --input_pe strainsY-Z*.fastq.gz
		--output strainsA-Z --log
		--gene_db CARD_v3.0.8_SRST2.fasta 
		--mlst_db Escherichia_coli.fasta 
		--mlst_definitions ecoli.txt
		--prev_output ShigellaA__genes__resistance__results.txt
        	 ShigellaA__mlst__Escherichia_coli__results.txt
        	 ShigellaB__genes__resistance__results.txt
        	 ShigellaB__mlst__Escherichia_coli__results.txt
        	 ShigellaC-X__compiledResults.txt

Run against Enterococcus reads, where read names are different from the usual _1.fastq and _2.fastq

srst2 --input_pe strain_R1.fastq.gz strain_R2.fastq.gz 
	--forward _R1 --reverse _R2 
	--output strainA --log 
	--gene_db CARD_v3.0.8_SRST2.fasta 
	--mlst_db Enterococcus_faecium.fasta 
	--mlst_definitions efaecium.txt

Compile results from completed runs

srst2 --prev_output *compiledResults.txt --output Shigella_report

Running lots of jobs and compiling results

Run against multiple read sets: submitting 1 job per readset to SLURM queueing system.

slurm_srst2.py --script srst2 
	--output test 
	--input_pe *.fastq.gz 
	--other_args '--gene_db CARD_v3.0.8_SRST2.fasta 
	--mlst_db Escherichia_coli.fasta 
	--mlst_definitions ecoli.txt 
	--save_scores' 
	--walltime 0-1:0 
		> job_sub_list.txt

The results from all the separate runs can then be compiled together using:

srst2 --prev_output *compiledResults.txt --output Shigella_report

SLURM job script usage options

slurm_srst2.py -h

optional arguments:
  -h, --help            show this help message and exit
  
  --walltime WALLTIME   wall time (default 0-0:30 = 30 minutes)
  
  --memory MEMORY       mem (default 4096 = 4gb)
  
  --rundir RUNDIR       directory to run in (default current dir)
  
  --script SCRIPT       SRST2 script (/path/to/srst2_150
                        9_reporting2.py)
                        
  --output OUTPUT       identifier for outputs (will be combined with read set
                        identifiers)
                        
  --input_se INPUT_SE [INPUT_SE ...]
                        Single end read file(s) for analysing (may be gzipped)
                        
  --input_pe INPUT_PE [INPUT_PE ...]
                        Paired end read files for analysing (may be gzipped)
                        
  --forward FORWARD     Designator for forward reads (only used if NOT in
                        MiSeq format sample_S1_L001_R1_001.fastq.gz; otherwise
                        default is _1, i.e. expect forward reads as
                        sample_1.fastq.gz)
                        
  --reverse REVERSE     Designator for reverse reads (only used if NOT in
                        MiSeq format sample_S1_L001_R2_001.fastq.gz; otherwise
                        default is _2, i.e. expect forward reads as
                        sample_2.fastq.gz)
                        
  --other_args OTHER_ARGS
                        string containing all other arguments to pass to srst2

Known issues

Reference indexing - SRST2 uses bowtie2 for mapping reads to reference sequences. To do this, SRST2 must first check the index exists, call bowtie2-build to generate the index if it doesn't already exist, and then call bowtie2 to map the reads to this indexed reference. Occasionallly bowtie2 will return an Error message saying that it doesn't like the index. This seems to be due to the fact that if you submit multiple SRST2 jobs to a cluster at the same time, they will all test for the presence of the index and, if index files are present, will proceed with mapping... but this doesn't mean the indexing process is actually finished, and so errors will arise.

The simple way out of this is, if you are running lots of SRST2 jobs, FIRST index your reference(s) for bowtie2 and samtools (using bowtie2-build ref.fasta ref.fasta and samtools faidx ref.fasta), then submit your SRST2 jobs. The slurm_srst2.py script takes care of this for you by formatting the databases before submitting any SRST2 jobs.

Generating SRST2-compatible clustered database from raw sequences

Gene database format

To do this properly, SRST2 needs to know which of the reference sequences are alleles of the same gene. This is done by adhering to the following format in the naming of the sequences (i.e. the headers in the fasta sequence for the database):

>[clusterUniqueIdentifier]__[clusterSymbol]__[alleleSymbol]__[alleleUniqueIdentifier]

e.g. in the resistance gene database provided, the first entry is:

>344__blaOXA__blaOXA-181__1

Note these are separated by two underscores. The individual components are:

clusterUniqueIdentifier = 344; unique identifier for this cluster (uniquely identifes the cluster) clusterSymbol = blaOXA; gene symbol for this cluster (may be shared by multiple clusters) alleleSymbol = blaOXA-181; full name of this allele alleleUniqueIdentifier = 1; uniquely identifies the sequence

Ideally the alleleSymbol would be unique (as it is in the reference.fasta file provided). However it doesn't have to be: if allele symbols are not unique, then SRST2 will use the combination [alleleSymbol]__[alleleUniqueIdentifier] to uniquely identify the sequence in the resulting reports, so that you can trace exactly which sequence was present in each sample.

Additional gene annotation can appear on the header line, after a space. This additional info will be printed in the full genes report, but not in the compiled results files.

e.g. for the blaOXA sequence above, the full header is actually:

>344__blaOXA__blaOXA-181__1 blaOXA-181_1_HM992946; HM992946; betalactamase

Sourcing suitable gene databases

To get started, we have provided databases for resistance genes, plasmid genes, and serotyping for E. coli and Group B Streptococcus (see Preformatted databases for specialist typing with SRST2; and code (database_clustering/) to extract virulence factors for a genus of interest from the Virulence Factor DB (detailed instructions below).

If you want to use your own database of allele sequences, with the reporting behaviour described, you will need to assign your sequences to clusters and use this header format. To facilitate this, use the scripts provided in the database_clustering directory provided with SRST2, and follow the instructions below.

You can also use unclustered sequences. This is perfectly fine for gene detection applications, where you have one representative allele sequence for each gene, and you simply want to know which samples contain a sequence that is similar to this one (e.g. detecting plasmid replicons, where there is one target sequence per replicon). However, this won't work well for allele typing. If the sequence database contains multiple allele sequences for the same gene, then all of these that are covered above the length threshold (default 90%) will be reported in the output, which makes for messy reporting. If you do this, you would probably find it most useful to look at the full gene results table rather than looking at the compiled results output.

If you already know which alleles belong to the same gene family and should be clustered for reporting:

Use this info to generate the appropriate headers for SRST2 to read. The provided script csv_to_gene_db.py can help:

csv_to_gene_db.py -h

Usage: csv_to_gene_db.py [options]

Options:

  -h, --help            show this help message and exit
  
  -t TABLE_FILE, --table=TABLE_FILE
                        table to read (csv)
                        
  -o OUTPUT_FILE, --out=OUTPUT_FILE
                        output file (fasta)
                        
  -s SEQ_COL, --seq_col=SEQ_COL
                        column number containing sequences
                        
  -f FASTA_FILE, --fasta=FASTA_FILE
                        fasta file to read sequences from (must specify which
                        column in the table contains the sequence names that
                        match the fasta file headers)
                        
  -c HEADERS_COL, --headers_col=HEADERS_COL
                        column number that contains the sequence names that
                        match the fasta file headers

The input table should be comma-separated (csv, unix newline characters) and have these columns:

seqID,clusterid,gene,allele,(DNAseq),other....

which will be used to make headers of the required form [clusterID]__[gene]__[allele]__[seqID] [other stuff]

If you have the sequences as a column in the table, specify which column they are in using -s:

csv_to_gene_db.py -t genes.csv -o genes.fasta -s 5

Alternatively, if you have sequences in a separate fasta file, you can provide this file via -f. You will also need to have a column in the table that links the rows to unique sequences, specify which column this is using -c:

csv_to_gene_db.py -t genes.csv -o genes.fasta -f rawseqs.fasta -c 5

Clustering sequences

If your sequences are not already assigned to gene clusters, you can do this automatically using CD-HIT (http://weizhong-lab.ucsd.edu/cd-hit/).

1 - Run CD-HIT to cluster the sequences at 90% nucleotide identity:

cdhit-est -i rawseqs.fasta -o rawseqs_cdhit90 -d 0 > rawseqs_cdhit90.stdout

2 - Parse the cluster output and tabulate the results, check for inconsistencies between gene names and the sequence clusters, and generate individual fasta files for each cluster to facilitate further checking:

python cdhit_to_csv.py --cluster_file rawseqs_cdhit90.clstr --infasta raw_sequences.fasta --outfile rawseqs_clustered.csv

For comparing gene names to cluster assignments, this script assumes very basic nomenclature of the form gene-allele, ie a gene symbol followed by '-' followed by some more specific allele designation. E.g. adk-1, blaCTX-M-15. The full name of the gene (adk-1, blaCTX-M-15) will be stored as the allele, and the bit before the '-' will be stored as the name of the gene cluster (adk, blaCTX). This won't always give you exactly what you want, because there really are no standards for gene nomenclature! But it will work for many cases, and you can always modify the script if you need to parse names in a different way. Note though that this only affects how sensible the gene cluster nomenclature is going to be in your srst2 results, and will not affect the behaviour of the clustering (which is purely sequence based using cdhit) or srst2 (which will assign a top scoring allele per cluster, the cluster name may not be perfect but the full allele name will always be reported anyway).

3 - Convert the resulting csv table to a sequence database using:

csv_to_gene_db.py -t rawseqs_clustered.csv -o seqs_clustered.fasta -f rawseqs.fasta -c 4

The output file, seqs_clustered.fasta, should now be ready to use with srst2 (--gene_db seqs_clustered.fasta).

If there are potential inconsistencies detected at step 2 above (e.g. multiple clusters for the same gene, or different gene names within the same cluster), you may like to investigate further and change some of the cluster assignments or cluster names. You may find it useful to generate neighbour joining trees for each cluster that contains >2 genes, using align_plot_tree_min3.py

Screening for resistance genes with SRST2

A database of resistance genes CARD v3.0.8 is included with SRST2: data/CARD_v3.0.8_SRST2.fasta. The fasta file is ready for use with SRST2. The CSV table /data/CARD_v3.0.8_clustered.csv contains the same sequence information, but in tabular format for easier parsing/editing.

An easy way to add sequences to this database would be to add new rows to the table, and then generate an updated fasta file using:

csv_to_gene_db.py -t rawseqs_clustered.csv -o seqs_clustered.fasta -s rawseqs.fasta -c 5

Using the VFDB Virulence Factor Database with SRST2

The VFDB houses sets of virulence genes for a range of bacterial genera, see http://www.mgc.ac.cn/VFs/.

To type these virulence genes using SRST2, download the full set of sequences from the VFDB website (http://www.mgc.ac.cn/VFs/Down/VFDB_setB_nt.fas.gz) and follow these steps to generate SRST2-compatible files for your genus of interest.

1 - Extract virulence genes by genus from the main VFDB file, CP_VFs.ffn:

python VFDBgenus.py --infile CP_VFs.ffn --genus Clostridium

or, to get all availabel genera in separate files:

python VFDBgenus.py --infile CP_VFs.ffn

2 - Run CD-HIT to cluster the sequences for this genus, at 90% nucleotide identity:

cd-hit -i Clostridium.fsa -o Clostridium_cdhit90 -c 0.9 > Clostridium_cdhit90.stdout

3 - Parse the cluster output and tabulate the results using the specific Virulence gene DB compatible script:

python VFDB_cdhit_to_csv.py --cluster_file Clostridium_cdhit90.clstr --infile Clostridium.fsa --outfile Clostridium_cdhit90.csv

4 - Convert the resulting csv table to a SRST2-compatible sequence database using:

python csv_to_gene_db.py -t Clostridium_cdhit90.csv -o Clostridium_VF_clustered.fasta -s 5

The output file, Clostridium_VF_clustered.fasta, should now be ready to use with srst2 (--gene_db Clostridium_VF_clustered.fasta).

Preformatted databases for specialist typing with SRST2

Using the EcOH database for serotyping E. coli with SRST2

Details can be found in this MGen paper.

The EcOH database includes genes for identifying O and H types in E. coli, see /data/EcOH.fasta

O types are represented by the presence of two loci (either wzx and wzy, or wzm and wzt). Note that allelic variation is possible but does not impact serotype in a predictable way, so typing calls should be made based on the presence of genes rather than allele assignments (i.e. it is generally safe to ignore *? characters)

H types are represented by alleles of fliC or flnA flagellin genes (one allele per H type). Note that it is possible to carry both a fliC allele and a flnA allele, such strains should be considered phase variable for flagellin).

Basic Usage – Serotyping E. coli

srst2 --input_pe strainA_1.fastq.gz strainA_2.fastq.gz --output strainA_serotypes --log --gene_db EcOH.fasta

Example

Note these reads sets are each ~100 MB each, so you need to download 400 MB of test data to run this example

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR178/ERR178148/ERR178148_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR178/ERR178148/ERR178148_2.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR178/ERR178156/ERR178156_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR178/ERR178156/ERR178156_2.fastq.gz

srst2 --input_pe ERR178148*.fastq.gz ERR178156*.fastq.gz --output serotypes --log --gene_db EcOH.fasta

Results will be output in: [prefix]__genes__EcOH__results.txt

Output from the above example would appear in: serotypes__genes__EcOH__results.txt

Sample	fliC	wzm	wzt	wzx	wzy
ERR178148	fliC-H31_32*	-	-	wzx-O131_165*	wzy-O131_386
ERR178156	fliC-H33_34	wzm-O9_93*	wzt-O9_107*	-	-

Interpretation

ERR178148 has matching wzx and wzy hits for O131, and fliC allele H31, thus the predicted serotype is O131:H31.

ERR178156 has matching wzm and wzt hits for O9 and fliC allele H33, thus the predicted serotype is O9:H33.

Note that each O antigen type is associated with loci containing EITHER wzx and wzy, OR wzm and wzt genes.

No alleles for flnA were detected in these strains, indicating they are not phase variable for flagellin.

Using the GBS-SBG database for serotyping Streptococcus agalactiae (Group B Streptococcus) with SRST2

Details can be found in this preprint. The database and a script for typing using assemblies can be found in the GBS-SBG GitHub repository. The fasta database itself is already properly formatted for SRST2 and this is all that's needed for SRST2 to type using short reads.

Basic Usage - Serotyping GBS

srst2 --input_pe strainA_1.fastq.gz strainA_2.fastq.gz --output strainA_test --log --gene_db GBS-SBG.fasta

Example

Note these read sets are ~150 MB each, so you need to download ~600 MB of data to run this example.

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR632/000/SRR6327950/SRR6327950_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR632/000/SRR6327950/SRR6327950_2.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR632/007/SRR6327887/SRR6327887_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR632/007/SRR6327887/SRR6327887_2.fastq.gz

srst2 --input_pe SRR6327950*.fastq.gz SRR6327887*.fastq.gz --output GBS-serotypes --log --gene_db GBS-SBG.fasta

The output from the above will be in GBS-serotypes__genes__GBS-SBG__results.txt.

Sample	GBS-SBG
SRR6327887	GBS-SBG:Ia*
SRR6327950	GBS-SBG:III-4

Interpretation

SRR6327887 is serotype Ia; it has some variants relative to the locus, which is why it has a *. Looking at GBS-serotypes__fullgenes__GBS-SBG__results.txt, we see that it has 24 SNPs and 1 indel (for a 0.358% divergence - i.e. 99.6% identical to the reference sequence (~6.7 kb)). As only one representative of each serotype is in the reference database, some divergence is not unexpected, and this appears to be a reliable call. SRR6327950 is serotype III, subtype 4.

Typing the LEE pathogenicity island of E. coli

Details can be found in this Nature Microbiology paper.

The LEE typing database is based on analysis of >250 LEE-containing E. coli genomes and includes 7 loci (eae (i.e. intimin), tir, espA, espB, espD, espH, espZ). The data is provided as a MLST-style database, in which combinations of alleles are assigned to a LEE subtype, to facilitate a common nomenclature for LEE subtypes. However please note the database does not contain every known allele and is not intended to function as a typical MLST scheme. Rather, each sequence in the database represents a cluster of closely related alleles that have been assigned to the same locus type. So you will generally not get precise matches to a known allele, and this is not something to worry about - the goal is to identify the locus type and the overall LEE subtype.

Also note that the LEE scheme includes three distinct lineages: Lineage 1 consists of LEE subtypes 1-2; Lineage 2 consists of LEE subtypes 3-8; Lineage 3 consists of LEE subtypes 9-30. See the paper for more details.

The database files (sequences + MLST-style profile definitions) are in the SRST2 distribution under /data.

Basic Usage – Typing E. coli LEE variants

srst2 --input_pe strainA_1.fastq.gz strainA_2.fastq.gz --output strainA_LEE --log --mlst_db LEE_mlst.fasta --mlst_definitions LEE_STscheme.txt

Example

Note these reads sets are each ~100 MB each, so you need to download 400 MB of test data to run this example

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR178/ERR178148/ERR178148_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR178/ERR178148/ERR178148_2.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR178/ERR178156/ERR178156_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR178/ERR178156/ERR178156_2.fastq.gz

srst2 --input_pe ERR178156*.fastq.gz ERR124656*.fastq.gz --output LEE --log --mlst_db LEE_mlst.fasta --mlst_definitions LEE_STscheme.txt

Results will be output in: [prefix]__mlst__LEE_mlst__results.txt

Output from the above example would appear in: LEE__mlst__LEE_mlst__results.txt

Sample	ST	eae	tir	espA	espB	espD	espH	espZ	mismatches	uncertainty	depth	maxMAF
ERR178148	30*	19	10*	9*	5*	6*	4*	13*	tir-10/33snp3indel;espA-9/3snp;espB-5/4snp;espD-6/9snp;espH-4/1snp;espZ-13/3snp	-	59.8217142857	0.285714285714
ERR178156	9*	6*	5*	5*	3*	3*	4*	8	eae-6/4snp;tir-5/18snp;espA-5/6snp;espB-3/4snp;espD-3/8snp;espH-4/3snp	-	33.5062857143	0.444444444444

Interpretation

ERR178148 and ERR178156 carry LEE subtypes 30 and 9 respectively, which both belong to LEE lineage 3. The imprecise matches are to be expected and can be taken as a confident assignment of LEE type.

Plotting output in R

Some R functions are provided in scripts/plotSRST2data.R for plotting SRST2 output to produce images like those in the paper (e.g. Figure 8: http://www.genomemedicine.com/content/6/11/90/figure/F8)

These functions require the ape package to be installed.

Example usage:

# load the functions in R
source("srst2/scripts/plotSRST2data.R")

1. EXAMPLE FROM FIGURE 8A (http://www.genomemedicine.com/content/6/11/90/figure/F8): viewing resistance genes in individual strains that have been analysed for MLST + resistance genes

# read in a compiled MLST + genes table output by SRST2
Ef_JAMA<-read.delim("srst2/data/EfaeciumJAMA__compiledResults.txt", stringsAsFactors = F)

# Check column names. Sample names are in column 1 (strain_names=1 in the function call below), MLST data is in columns 2 to 9 (mlst_columns = 2:9), while gene presence/absence information is recorded in columns 13 to 31 (gene_columns = 13:31). 

colnames(Ef_JAMA)
 [1] "Sample"         "ST"             "AtpA"           "Ddl"            "Gdh"            "PurK"          
 [7] "Gyd"            "PstS"           "Adk"            "mismatches"     "uncertainty"    "depth"         
[13] "Aac6.Aph2_AGly" "Aac6.Ii_AGly"   "Ant6.Ia_AGly"   "Aph3.III_AGly"  "Dfr_Tmt"        "ErmB_MLS"      
[19] "ErmC_MLS"       "MsrC_MLS"       "Sat4A_AGly"     "TetL_Tet"       "TetM_Tet"       "TetU_Tet"      
[25] "VanA_Gly"       "VanH.Pt_Gly"    "VanR.A_Gly"     "VanS.A_Gly"     "VanX.M_Gly"     "VanY.A_Gly"    
[31] "VanZ.A_Gly"    

# Make a tree based on the MLST loci, and plot gene content as a matrix. 
# cluster=T turns on hierarchical clustering of the columns (=genes); labelHeight, infoWidth and treeWidth control the relative dimensions of the plotting areas available for the tree, printed MLST information, and gene labels.

geneContentPlot(m=Ef_JAMA, mlst_columns = 2:9, gene_columns = 13:31, strain_names=1, cluster=T, labelHeight=40, infoWidth=15, treeWidth=5)

2. EXAMPLE FROM FIGURE 7C (http://www.genomemedicine.com/content/6/11/90/figure/F7): viewing summaries of resistance genes by ST, in a set of isolates that have been analysed for MLST + resistance genes

# read in a compiled MLST + genes table output by SRST2
Ef_Howden<-read.delim("srst2/data/EfaeciumHowden__compiledResults.txt", stringsAsFactors = F)

# Make a tree of STs based on the MLST alleles and plot this; Group isolates by ST and calculate the number of strains in each ST that contain each resistance gene;  plot these counts as a heatmap. Note that barplots of the number of strains in each ST are also plotted to the right.
geneSTplot(d,mlst_columns=8:15,gene_columns=17:59,plot_type="count",cluster=T)

# Same as above but plot the rate of detection of each gene within each ST, not the raw counts
geneSTplot(d,mlst_columns=8:15,gene_columns=17:59,plot_type="rate",cluster=T)

# To suppress SNPs, i.e. collapse ST1 and ST1* into a single group for summarisation at the clonal complex level, set suppressSNPs=T.

# To suppress uncertainty due to low depth, i.e. collapse ST1 and ST1? into a single group for summarisation at the clonal complex level, set suppressUncertainty=T.

Note, heatmap colours can be set via the matrix.colours parameter in both of these functions. The default value is matrix.colours=colorRampPalette(c("white","yellow","blue"),space="rgb")(100), i.e. white=0% gene frequency, yellow = 50% and blue = 100%.

srst2's People

Contributors

Stargazers

Watchers

srst2's Issues

Difference in log files for negatives when using gene_db

Hi Kat, this is in reply to your email, but I have included the original query below also.

The files that are created are:
3__3-paired.gene_db.sam
3__3-paired.gene_db.sam.mod
3__3-paired.gene_db.unsorted.bam
3__genes__gene_db__results.txt
However the top 3 are empty and the genes results table just has the isolate number in it?

I've run 10 different genes using 10 different gene_db's now and have returned this error in the log file on a number of negatives - but a lot of them are also ones I'd expect to be negative. Sometimes when an isolate has a positive result at 90% and negative at 60% or vice versa and this log file error has been present it's indicated a file that needs to be re-run (I'm doing 1500 and the internet hamster in Sydney sometimes drops my VPN connection!) But a lot of them when I've rerun them have returned the same log file error and seem to be fine in all other aspects?

Many thanks in advance

Karen

Original query -
I am currently using SRST2 to map reads to target sequences using gene_db. I keep getting a difference in log files for my negatives? Some seem to run fine and return the usual but some of my log files are returning a message after the samtools portion which I have highlighted below.

I have run a different gene_db on this same isolate and not had this message turn up in the log file and similarly I have used the same gene_db.fasta file on different isolates and they seem to have worked fine?

Is this actually a run error and false negative or should I just assume this is a negative for the target sequence? Am I interpreting the log files incorrectly? The below message (2) occurs on the screen when the log files change also? I have tried forcing 0.1.18 version of samtools but this seems to have no effect?

Log file error
08/27/2015 04:47:29 program started
08/27/2015 04:47:29 command line: /local/software/python/2.7.5/bin/srst2 --input_pe /scratch/kc1e12/PhD/WAIT_trimmed/3-paired_1.fastq.gz /scratch/kc1e12/PhD/WAIT_trimmed/3-paired_2.fastq.gz --log$
08/27/2015 04:47:29 Total paired readsets found:1
08/27/2015 04:47:29 Index for gene_db.fasta is already built...
08/27/2015 04:47:29 Processing database gene_db.fasta
08/27/2015 04:47:29 Processing sample 3-paired
08/27/2015 04:47:29 Output prefix set to: 3__3-paired.gene_db
08/27/2015 04:47:29 Aligning reads to index gene_db.fasta using bowtie2...
08/27/2015 04:47:29 Running: bowtie2 -1 /scratch/kc1e12/PhD/WAIT_trimmed/3-paired_1.fastq.gz -2 /scratch/kc1e12/PhD/WAIT_trimmed/3-paired_2.fastq.gz -S 3__3-paired.gene_db.sam -q --very-sensi$
08/27/2015 04:48:28 Processing Bowtie2 output with SAMtools...
08/27/2015 04:48:28 Generate and sort BAM file...
08/27/2015 04:48:28 Running: samtools view -b -o 3__3-paired.gene_db.unsorted.bam -q 1 -S 3__3-paired.gene_db.sam.mod
08/27/2015 04:48:28 {'message': "Command 'samtools view -b -o 3__3-paired.gene_db.unsorted.bam -q 1 -S 3__3-paired.gene_db.sam.mod' failed with non-zero exit status: 1"}
08/27/2015 04:48:28 failed gene detection
08/27/2015 04:48:28 Tabulating results for database gene_db.fasta ...
08/27/2015 04:48:28 Finished processing for database gene_db.fasta ...
08/27/2015 04:48:28 Gene detection output printed to 3__genes__gene_db__results.txt
08/27/2015 04:48:28 SRST2 has finished.

Screen error
' is recognized as '*'.
[main_samview] truncated file.

Rcode

Hello Kat,

I love SRST2, but I would like to plot out the results the readme file says "Added R code for plotting SRST2 output in R (plotSRST2data.R). Instructions will be added to the read me." When!! Is it ready? I was hoping to present it at lab meeting this Thursday. Is there any chance?!?

Please and thanks,
Paola

Extract new allele sequence

Thanks a lot for this perfect tool! But I can't figure out is it possible to extract new allele sequence after it's identification by mapping?

Gene detection reporting - top gene per gene symbol, rather than the top gene per cluster as intended

I found bug in the code whereby we were reporting the top gene per gene symbol, rather than the top gene per cluster.

So, in those cases where there are very distinct groups of genes that share a gene symbol, you would only ever get the top scoring allele amongst them. So you can miss genes.

For example I found some cases today where I was expecting blaOXA-23 to be present, and was getting blaOXA-66 reported as the allele for ‘blaOXA’. Actually blaOXA is a common gene symbol used for genes that span as low as 70% identity, so each of these subtypes of blaOXA need to be treated as different genes that could each have alleles present. We are prepared for this because we have several distinct blaOXA clusters annotated in our resistance gene database, and recommend pre-clustering of all user databases before using with SRST2… but the code was not using the clustering IDs properly. So, instead of having blaOXA-23 (cluster 297) and blaOXA-66 (cluster 299) reported as present, I was just seeing blaOXA-66 (blaOXA) in all the outputs.

SRST2 installation problem with multiple Bowtie versions

Hi,

I saw your paper and I'm really interested in what SRST2 can do.
I've just installed srsts2 following the installation instructions on https://github.com/katholt/srst2.

By default, install all programs in /usr/local. I already had samtools v1.2 and bowtie 2.2.5 in this directory before installing srst2.
So, I downloaded and installed bowtie2 2.2.4 and samtools 0.1.18 in the same directory (/usr/local/) before installing srst2.
During srst2 installation i did the test if the programs were installed properly and it was OK.
I established the
export SRST2_SAMTOOLS=/usr/local/samtools-0.1.18/
export SRST2_BOWTIE2=/usr/local/bowti2-2.2.4/bowtie2
export SRST2_BOWTIE2_BUILD=/usr/local/bowti2-2.2.4/bowtie2-build

I then tried to do a test of the tool using the example.txt. When i run the command
sudo /usr/local/bin/srst2 --input_pe ERR024070_1.fastq.gz ERR024070_2.fastq.gz --output shigella1 --log --save_scores --mlst_db Escherichia_coli#1.fasta --mlst_definitions ecoli.txt --gene_db ARGannot.fasta

I only get the shigella1.log and when I look at the log this is what I see ($ gedit shigella1.log):

01/15/2016 12:21:15 program started
01/15/2016 12:21:15 command line: /usr/local/bin/srst2 --input_pe ERR024070_1.fastq.gz ERR024070_2.fastq.gz --output shigella1 --log --save_scores --mlst_db Escherichia_coli#1.fasta --mlst_definitions ecoli.txt --gene_db ARGannot.fasta
01/15/2016 12:21:15 Total paired readsets found:1
01/15/2016 12:21:15 Failed command: bowtie2 --version
01/15/2016 12:21:15 [Errno 2] No such file or directory
01/15/2016 12:21:15 Could not determine the version of bowtie.
01/15/2016 12:21:15 Do you have bowtie installed in your PATH?

I'm a newbie in bioinformatics (using the command line) so the "bug" must be very simple but I don't see how to solve it...

Perhaps this info is also useful : when I installed bowtie2 2.2.5 I also defined it on my PATH : export BT2_HOME=/usr/local/bowtie2-2.2.5/

Could this create a conflict?

Any help will be welcomed !:-)

Thanks in advance for you answer/advices.

Sincerely,
P

Error with bowtie

Hi...

I have problem when try to run the command srst2

Input files DNA, FASTA:
salmonella.fasta
Error: could not open salmonella.fasta
Total time for call to driver() for forward index: 00:00:00
Error: Encountered internal Bowtie 2 exception (#1)
Command: bowtie2-build --wrapper basic-0 salmonella.fasta salmonella.fasta
Traceback (most recent call last):
File "/usr/local/bin/srst2", line 9, in
load_entry_point('srst2==0.1.7', 'console_scripts', 'srst2')()
File "/usr/local/lib/python2.7/dist-packages/srst2/srst2.py", line 1604, in main
bowtie_index(args.mlst_db) # index the MLST database
File "/usr/local/lib/python2.7/dist-packages/srst2/srst2.py", line 154, in bowtie_index
run_command([bowtie_build_exec, fasta, fasta])
File "/usr/local/lib/python2.7/dist-packages/srst2/srst2.py", line 134, in run_command
raise CommandError({"message": message})
srst2.srst2.CommandError: {'message': "Command 'bowtie2-build salmonella.fasta salmonella.fasta' failed with non-zero exit status: 1"}

as I can resolve this error?....

I have a second question, if I have four files form miseq and nexseq ( 2file_R1 and 2file_R2 ) belonging to the same sample as I can gather the reads for the same result?

Why are my isolate names appended with a '?' in some cases but not all?

When running the command
srst2 --prev_output *_genes__*__results.txt --output xyz
Why are some isolate names appended with a '?' in the output table?

Add example output to doc (as figure since tables don't display properly)

file scores_vs_expected.py seems missing

Hi Kat and others,

I tried to install srst2 today and run into some issues which seems to be related to missing file called scores_vs_expected.py. using the git clone command, followed by pip install.

When I copied that file from srst2-0.1.7.zip version and placed in into my directory I got it installed (i.e. it says it got installed). However, the installation is not very successful as, when asking for the current version, it gives out random things (see below).

Could you please direct me to the right version of that file?
Many thanks in advance,

Pieter Smit

Traceback (most recent call last):
File "/usr/local/bin/srst2", line 9, in
load_entry_point('srst2==0.1.8', 'console_scripts', 'srst2')()
File "/usr/lib/python3/dist-packages/pkg_resources.py", line 351, in load_entry_point
return get_distribution(dist).load_entry_point(group, name)
File "/usr/lib/python3/dist-packages/pkg_resources.py", line 2363, in load_entry_point
return ep.load()
File "/usr/lib/python3/dist-packages/pkg_resources.py", line 2088, in load
entry = import(self.module_name, globals(),globals(), ['name'])
File "/usr/local/lib/python3.4/dist-packages/srst2/srst2.py", line 306
print "Warning! MLST delimiter is " + delimiter + " but these genes may violate the pattern and cause problems:"
^
SyntaxError: Missing parentheses in call to 'print'
Traceback (most recent call last):
File "/usr/local/bin/srst2", line 9, in
load_entry_point('srst2==0.1.8', 'console_scripts', 'srst2')()
File "/usr/lib/python3/dist-packages/pkg_resources.py", line 351, in load_entry_point
return get_distribution(dist).load_entry_point(group, name)
File "/usr/lib/python3/dist-packages/pkg_resources.py", line 2363, in load_entry_point
return ep.load()
File "/usr/lib/python3/dist-packages/pkg_resources.py", line 2088, in load
entry = import(self.module_name, globals(),globals(), ['name'])
File "/usr/local/lib/python3.4/dist-packages/srst2/srst2.py", line 306
print "Warning! MLST delimiter is " + delimiter + " but these genes may violate the pattern and cause problems:"
^
SyntaxError: Missing parentheses in call to 'print'

download error

$ git clone https://github.com/katholt/srst2
Initialized empty Git repository in /n/home01/yuanli/07152014/srst2/.git/
fatal: https://github.com/katholt/srst2/info/refs download error - The requested URL returned error: 403 Forbidden

Instruction for the srst2_plots.R

Hi! I am trying to use the srst2_plots.R script to create similar plot as shown in Figure 7. When I am generating ST vs gene plot using the geneSTplot function, what should I use as an input file? Similarly for the geneSTbarPlot? Thank you!

VFDB_cdhit_to_csv.py throws KeyError for Mycobacterium

I am following the directions from https://github.com/katholt/srst2#using-the-vfbd-virulence-factor-database-with-srst2 to generate a VFDB gene database for Mycobacterium. I have already run these commands:

python VFDBgenus.py --infile CP_VFs.ffn
cd-hit -i Mycobacterium.fsa -o Mycobacterium_cdhit90 -c 0.9 > Mycobacterium_cdhit90.stdout
(FYI, I am using cd-hit-v4.6.1-2012-08-27)

But when I run this command:
python VFDB_cdhit_to_csv.py --cluster_file Mycobacterium_cdhit90.clstr --infile Mycobacterium.fsa --outfile Mycobacterium_cdhit90.csv

I get this error:
Traceback (most recent call last):
File "/install/srst2/database_clustering/VFDB_cdhit_to_csv.py", line 66, in
sys.exit(main())
File "/install/srst2/database_clustering/VFDB_cdhit_to_csv.py", line 59, in main
clusterid = seq2cluster[seqID]
KeyError: 'R027152'

Fails on bowtie 2.2.x series, needs 2.1.0 only?

Seems to require bowtie 2.1.0 specifically? (which is nearly 2 years old now). We have the 2.2.2 version as default, but it fails:

11/28/2014 12:02:04 program started
11/28/2014 12:02:04 command line: /bio/sw/srst2/bin/srst2 --input_pe ../dropbox/E.faecium/ALL/Ef_aus2223/aus2223_S141_L002_R1_001.fastq.gz ../dropbox/E.faecium/ALL/Ef_aus2223/aus2223_S141_L002_R2_001.fastq.gz --output Aus2223_mlst --log --mlst_db Enterococcus_faecium.fasta --mlst_definitions efaecium.txt
11/28/2014 12:02:04 Total paired readsets found:1
11/28/2014 12:02:04 Incorrect version of bowtie installed.
11/28/2014 12:02:04 bowtie version 2.1.0 is required by SRST2.

accommodate slashes in species names

From: Paul Cherng [email protected]
Date: 30 November 2014 08:46:47 GMT+11
To: [email protected]
Subject: SRST2 getmlst.py bug

Hi Dr. Holt,

I am using your SRST2 software and I have come across a bug with the getmlst.py script. It appears that the "Campylobacter concisus/curvus" species from http://pubmlst.org/data/ causes the script to crash because of the slash in the name. Here is the error I get:

getmlst.py --species "Campylobacter concisus/curvus"

Traceback (most recent call last):
File "/usr/local/bin/getmlst.py", line 183, in
main()
File "/usr/local/bin/getmlst.py", line 141, in main
species_all_fasta_file = open(species_all_fasta_filename, 'w')
IOError: [Errno 2] No such file or directory: u'Campylobacter_concisus/curvus.fasta'

I'm not sure what the best solution for this is, perhaps simply replacing slashes with underscores will suffice.

Thanks,
-Paul Cherng

single end and paired end reads for a sample

Hi Kat,

I have single end read (merged from paired end reads) and paired end reads (remaining result of not merged) for a sample and I am using all for one srst2 run. I am not sure if the --merge_paired option is used for this condition? Because when I tried I got this error:

File "/srst2/srst2.py", line 1514, in main
mate2.append(reads[1])
IndexError: list index out of range

Please advise. Many thanks.

Regards,
Cheng

SRST2v0.1.5 crashes when using gene_db with only gene names in the FastA headers

When using a gene_db with FastA headers in the form ">geneid" or ">geneid additional information", srst2 terminates on a crash (see traceback below).

Attempting to read xxx loci from ST database yyy
Read ST database yyy successfully
Traceback (most recent call last):
  File "../SRST2/srst2-0.1.5/scripts/srst2.py", line 1592, in <module>
    main()
  File "../SRST2/srst2-0.1.5/scripts/srst2.py", line 1548, in main
    db_reports, db_results = run_srst2(args,fileSets,args.gene_db,"genes")
  File "../SRST2/srst2-0.1.5/scripts/srst2.py", line 1102, in run_srst2
    db_reports, db_results_list = process_fasta_db(args, fileSets, run_type, db_reports, db_results_list, fasta)
  File "../SRST2/srst2-0.1.5/scripts/srst2.py", line 1164, in process_fasta_db
    unique_gene_symbols, unique_allele_symbols,run_type,ST_db,results,gene_list,db_report,cluster_symbols,max_mismatch)
  File "../SRST2/srst2-0.1.5/scripts/srst2.py", line 1275, in map_fileSet_to_db
    unique_gene_symbols, unique_allele_symbols, pileup_file)
  File "../SRST2/srst2-0.1.5/scripts/srst2.py", line 788, in parse_scores
    gene_name = get_allele_name_from_db(allele,unique_allele_symbols,unique_cluster_symbols,run_type,args)[2] # cluster ID
  File "../SRST2/srst2-0.1.5/scripts/srst2.py", line 750, in get_allele_name_from_db
    cluster_id = gene_name = allele_name = seqid = allele_parts[0]
IndexError: list index out of range

As a temporary solution, I modified the code following the patch hereafter.

--- ./srst2orig.py 2015-02-10 14:54:08.000000000 +0100
+++ ./srst2.py  2015-02-10 16:23:16.000000000 +0100
@@ -202,7 +202,8 @@
                                                gene_cluster_symbols[gene_cluster] = cluster_symbol
                                else:
                                        # treat as unclustered database, use whole header
-                                       gene_cluster = cluster_symbol = name
+                                       gene_cluster = cluster_symbol = name.split()[0] #debug: name
+                                       gene_cluster_symbols[gene_cluster] = cluster_symbol #debug
                        else:
                                gene_cluster = name.split(delimiter)[0] # accept gene clusters raw for mlst
                                # check if the delimiter makes sense
@@ -738,7 +739,7 @@
        if run_type != "mlst":
                # header format: >[cluster]___[gene]___[allele]___[uniqueID] [info]
                allele_parts = allele.split()
-               allele_detail = allele_parts.pop(0)
+               allele_detail = allele_parts[0] #debug allele_parts.pop(0)
                allele_info = allele_detail.split("__")

                if len(allele_info)>2:

Please note there is a potential bug in the last line “if len(allele_info)>2:” this should be “if len(allele_info)>3:” instead.

Best Regards,
Christophe

Need a maximum divergence option to simplify gene reporting.

I noticed that we were sometimes picking up quite distant homologs of genes that are in the database, e.g. calls with 250 SNPs against a gene of total length 1000bp. So 25% divergent. Unless you are actually looking for NOVEL genes rather than known ones, we don’t really want this! We JUST want the confident hits to known resistance genes, not hits to distant homologs. Note this might be useful for identifying new virulence genes, or generating hypotheses about unexplained resistance phenotypes, but just adds confusion to resistance gene typing.

Typo in srst2.py causing VFDB searches to fail

I ran into this error exclusively when trying to use a gene database derived from the VFDB files as described on the website. There is a typo on line 649 where the variable seqid is mistyped as "seq_id". it causes the following failure:

foo@computer2:SRST2$ srst2 --input_se foostaph.fastq.gz --output SA_test --log --gene_db Staphylococcus_VF_clustered.fasta
65260 reads; of these:
65260 (100.00%) were unpaired; of these:
61910 (94.87%) aligned 0 times
21 (0.03%) aligned exactly 1 time
3329 (5.10%) aligned >1 times
5.13% overall alignment rate
[samopen] SAM header is present: 1178 sequences.
[mpileup] 1 samples in 1 input files
Set max per-file depth to 8000
Traceback (most recent call last):
File "/usr/local/bin/srst2", line 9, in
load_entry_point('srst2==0.1.2', 'console_scripts', 'srst2')()
File "/usr/local/lib/python2.7/dist-packages/srst2-0.1.2-py2.7.egg/srst2/srst2.py", line 1316, in main
db_reports, db_results = run_srst2(args,fileSets,args.gene_db,"genes")
File "/usr/local/lib/python2.7/dist-packages/srst2-0.1.2-py2.7.egg/srst2/srst2.py", line 937, in run_srst2
db_reports, db_results_list = process_fasta_db(args, fileSets, run_type, db_reports, db_results_list, fasta)
File "/usr/local/lib/python2.7/dist-packages/srst2-0.1.2-py2.7.egg/srst2/srst2.py", line 991, in process_fasta_db
unique_gene_symbols, unique_allele_symbols,run_type,ST_db,results,gene_list,db_report)
File "/usr/local/lib/python2.7/dist-packages/srst2-0.1.2-py2.7.egg/srst2/srst2.py", line 1101, in map_fileSet_to_db
unique_gene_symbols, unique_allele_symbols)
File "/usr/local/lib/python2.7/dist-packages/srst2-0.1.2-py2.7.egg/srst2/srst2.py", line 679, in parse_scores
gene_name = get_allele_name_from_db(allele,unique_allele_symbols,unique_cluster_symbols,run_type,args)[0]
File "/usr/local/lib/python2.7/dist-packages/srst2-0.1.2-py2.7.egg/srst2/srst2.py", line 649, in get_allele_name_from_db
allele_name += "_" + seq_id
NameError: global name 'seq_id' is not defined

Does not appear to affect the function of SRST2 with the supplied resistance.fasta database or MLST functions. Correcting the spelling of the variable at line 649 to "seqid" corrects the problem.

Redundancy (CARB-2 and CARB-8 genes) in ARGannot.fasta

Hi, the sequence of CARB-2 and CARB-8 in this database are the same.

[Evidence] In our database,

137__CARB_Bla__CARB-2__559 yes;yes;CARB-2;Bla;HQ157204;747-1613;867
ATGAAGTTTTTATTGGCATTTTCGCTTTTAATACCATCCGTGGTTTTTGCAAGTAGTTCA
AAGTTTCAGCAAGTTGAACAAGACGTTAAGGCAATTGAAGTTTCTCTTTCTGCTCGTATA
GGTGTTTCCGTTCTTGATACTCAAAATGGAGAATATTGGGATTACAATGGCAATCAGCGC
TTCCCGTTAACAAGTACTTTTAAAACAATAGCTTGCGCTAAATTACTATATGATGCTGAG
CAAGGAAAAGTTAATCCCAATAGTACAGTCGAGATTAAGAAAGCAGATCTTGTGACCTAT
TCCCCTGTAATAGAAAAGCAAGTAGGGCAGGCAATCACACTCGATGATGCGTGCTTCGCA
ACTATGACTACAAGTGATAATACTGCGGCAAATATCATCCTAAGTGCTGTAGGTGGCCCC
AAAGGCGTTACTGATTTTTTAAGACAAATTGGGGACAAAGAGACTCGTCTAGACCGTATT
GAGCCTGATTTAAATGAAGGTAAGCTCGGTGATTTGAGGGATACGACAACTCCTAAGGCA
ATAGCCAGTACTTTGAATAAATTTTTATTTGGTTCCGCGCTATCTGAAATGAACCAGAAA
AAATTAGAGTCTTGGATGGTGAACAATCAAGTCACTGGTAATTTACTACGTTCAGTATTG
CCGGCGGGATGGAACATTGCGGATCGCTCAGGTGCTGGCGGATTTGGTGCTCGGAGTATT
ACAGCAGTTGTGTGGAGTGAGCATCAAGCCCCAATTATTGTGAGCATCTATCTAGCTCAA
ACACAGGCTTCAATGGCAGAGCGAAATGATGCGATTGTTAAAATTGGTCATTCAATTTTT
GACGTTTATACATCACAGTCGCGCTGA

137__CARB_Bla__CARB-8__564 yes;yes;CARB-8;Bla;GQ866976;1345-2211;867
ATGAAGTTTTTATTGGCATTTTCGCTTTTAATACCATCCGTGGTTTTTGCAAGTAGTTCA
AAGTTTCAGCAAGTTGAACAAGACGTTAAGGCAATTGAAGTTTCTCTTTCTGCTCGTATA
GGTGTTTCCGTTCTTGATACTCAAAATGGAGAATATTGGGATTACAATGGCAATCAGCGC
TTCCCGTTAACAAGTACTTTTAAAACAATAGCTTGCGCTAAATTACTATATGATGCTGAG
CAAGGAAAAGTTAATCCCAATAGTACAGTCGAGATTAAGAAAGCAGATCTTGTGACCTAT
TCCCCTGTAATAGAAAAGCAAGTAGGGCAGGCAATCACACTCGATGATGCGTGCTTCGCA
ACTATGACTACAAGTGATAATACTGCGGCAAATATCATCCTAAGTGCTGTAGGTGGCCCC
AAAGGCGTTACTGATTTTTTAAGACAAATTGGGGACAAAGAGACTCGTCTAGACCGTATT
GAGCCTGATTTAAATGAAGGTAAGCTCGGTGATTTGAGGGATACGACAACTCCTAAGGCA
ATAGCCAGTACTTTGAATAAATTTTTATTTGGTTCCGCGCTATCTGAAATGAACCAGAAA
AAATTAGAGTCTTGGATGGTGAACAATCAAGTCACTGGTAATTTACTACGTTCAGTATTG
CCGGCGGGATGGAACATTGCGGATCGCTCAGGTGCTGGCGGATTTGGTGCTCGGAGTATT
ACAGCAGTTGTGTGGAGTGAGCATCAAGCCCCAATTATTGTGAGCATCTATCTAGCTCAA
ACACAGGCTTCAATGGCAGAGCGAAATGATGCGATTGTTAAAATTGGTCATTCAATTTTT
GACGTTTATACATCACAGTCGCGCTGA

They are identical as revealed by megaBLAST. This redudancy already exists in the original ARG-ANNOT database as well as in GenBank:

gb|HQ157204.1|:747-1613|CARB-2
ATGAAGTTTTTATTGGCATTTTCGCTTTTAATACCATCCGTGGTTTTTGCAAGTAGTTCAAAGTTTCAGCAAGTTGAACAAGACGTTAAGGCAATTGAAGTTTCTCTTTCTGCTCGTATAGGTGTTTCCGTTCTTGATACTCAAAATGGAGAATATTGGGATTACAATGGCAATCAGCGCTTCCCGTTAACAAGTACTTTTAAAACAATAGCTTGCGCTAAATTACTATATGATGCTGAGCAAGGAAAAGTTAATCCCAATAGTACAGTCGAGATTAAGAAAGCAGATCTTGTGACCTATTCCCCTGTAATAGAAAAGCAAGTAGGGCAGGCAATCACACTCGATGATGCGTGCTTCGCAACTATGACTACAAGTGATAATACTGCGGCAAATATCATCCTAAGTGCTGTAGGTGGCCCCAAAGGCGTTACTGATTTTTTAAGACAAATTGGGGACAAAGAGACTCGTCTAGACCGTATTGAGCCTGATTTAAATGAAGGTAAGCTCGGTGATTTGAGGGATACGACAACTCCTAAGGCAATAGCCAGTACTTTGAATAAATTTTTATTTGGTTCCGCGCTATCTGAAATGAACCAGAAAAAATTAGAGTCTTGGATGGTGAACAATCAAGTCACTGGTAATTTACTACGTTCAGTATTGCCGGCGGGATGGAACATTGCGGATCGCTCAGGTGCTGGCGGATTTGGTGCTCGGAGTATTACAGCAGTTGTGTGGAGTGAGCATCAAGCCCCAATTATTGTGAGCATCTATCTAGCTCAAACACAGGCTTCAATGGCAGAGCGAAATGATGCGATTGTTAAAATTGGTCATTCAATTTTTGACGTTTATACATCACAGTCGCGCTGA

gb|GQ866976.1|:1345-2211|CARB-8
ATGAAGTTTTTATTGGCATTTTCGCTTTTAATACCATCCGTGGTTTTTGCAAGTAGTTCAAAGTTTCAGCAAGTTGAACAAGACGTTAAGGCAATTGAAGTTTCTCTTTCTGCTCGTATAGGTGTTTCCGTTCTTGATACTCAAAATGGAGAATATTGGGATTACAATGGCAATCAGCGCTTCCCGTTAACAAGTACTTTTAAAACAATAGCTTGCGCTAAATTACTATATGATGCTGAGCAAGGAAAAGTTAATCCCAATAGTACAGTCGAGATTAAGAAAGCAGATCTTGTGACCTATTCCCCTGTAATAGAAAAGCAAGTAGGGCAGGCAATCACACTCGATGATGCGTGCTTCGCAACTATGACTACAAGTGATAATACTGCGGCAAATATCATCCTAAGTGCTGTAGGTGGCCCCAAAGGCGTTACTGATTTTTTAAGACAAATTGGGGACAAAGAGACTCGTCTAGACCGTATTGAGCCTGATTTAAATGAAGGTAAGCTCGGTGATTTGAGGGATACGACAACTCCTAAGGCAATAGCCAGTACTTTGAATAAATTTTTATTTGGTTCCGCGCTATCTGAAATGAACCAGAAAAAATTAGAGTCTTGGATGGTGAACAATCAAGTCACTGGTAATTTACTACGTTCAGTATTGCCGGCGGGATGGAACATTGCGGATCGCTCAGGTGCTGGCGGATTTGGTGCTCGGAGTATTACAGCAGTTGTGTGGAGTGAGCATCAAGCCCCAATTATTGTGAGCATCTATCTAGCTCAAACACAGGCTTCAATGGCAGAGCGAAATGATGCGATTGTTAAAATTGGTCATTCAATTTTTGACGTTTATACATCACAGTCGCGCTGA

All of these sequences listed above are the same to each other. Therefore, we can consider to remove CARB-8 from the database ARGannot.fasta.

error in forward/reverse read identification

Sample column blank in outputmlstXYZ__results.txt

Hello SRST2 Team,
I'm running srsts2 on illumina reads.

I'm throwing my reads for Sequencing typing with below command:

sbatch -p 128gb --nodes=1 -J $i --time 12:0:0 -o $i".%j.out" -e $i".%j.stderr" ~/srst2-master/scripts/srst2.py --input_pe $forward_read $reverse_read --forward $forward_prefix --reverse $reverse_prefix --output $temp_dir/$i --log --save_scores --mlst_db Staphylococcus_aureus.fasta --mlst_definitions saureus.txt --mlst_delimiter '-'

$i is the sample/isolate name.
--output $temp_dir/$i
$temp_dir - is the output directory, and the prefix is $i i.e sample name

I get an output like below, where I'm having blank value in column for Sample:

Sample ST arcc aroe glpf gmk_ pta_ tpi_ yqil mismatches uncertainty depth maxMAF

    121     6       5       6       2       7       14      5       0       -       132.005428571   0.0512820512821`

However, when I run on data ERR024070, as in example.txt bundled with package, I get output as expected, with value in Sample column.

I've tried multiple times, but it seems output won't listen to me.

Is this because I'm having reads with different naming scheme?
Or because I'm running script incorrectly?

Kindly advise.

Many thanks.

global name 'seq_id' is not defined error

I get this error when I use SRST2, How can I resolve it .

Error in step: Using the VFDB Virulence Factor Database with SRST2

1: The URL to the VFDB has changed and, using wget, should now be:

wget http://www.mgc.ac.cn/VFs/Down/VFDB_setB_nt.fas.gz

2: We need three extra steps for the database parsing scripts to work.

i) gunzip VFDB_setB_nt.fas.gz
ii) mv VFDB_setB_nt.fas VFDB_setB_nt.fas.ffn
iii) save the following code as convert.py and execute as python convert.py, which gets rid of the "(gi:xxxxx)" bit in the sequence headers and the commas (replacing with '_') because downstream processing splits on the commas:

with open("VFDB_setB_nt.fas.ffn", "r") as input_handle:
    with open("VFDB_setB_nt.fas_corrected.ffn", "w") as output_handle:
        for line in input_handle:
            if '(gi:' in line:
                line=line.replace('(gi:', ' (gi:').replace(',','_')
                line = line.split()
                output_handle.write(line[0]+' '+' '.join(line[2:])+'\n')
            else:
                output_handle.write(line)

Without removing the text "(gi:xxxxx)" from the header, this next step will not work:

python VFDB_cdhit_to_csv.py --cluster_file Clostridium_cdhit90.clstr --infile Clostridium.fsa --outfile Clostridium_cdhit90.csv

Also, there is a redundant comment on line 46 in the script VFDB_cdhit_to_csv.py (should read "VFGxxxxxx"):

schultzm@x:gene_dbs $ grep 'the unique ID R0xxx' ~/VFDB_cdhit_to_csv.py -n
46:         database[ClusterNr].append(seqID) # for virulence gene DB, this is the unique ID R0xxx

allow multiple read sets per sample, merged to give single result

Couple of fixes for

Dear Kat,

Firstly, thank you for providing such an excellent program. I have been using SRST2 on a few of our more unusual bugs. I encountered a couple of problems that I have fixed. As they were minor fixes, but big problems, I thought you would like to know.

1/ When using the --report_all_consensus. I was getting a bug which originated from line 764 where the output file contained a directory path. It was appending the directory path to the output file and generating an error. I fixed it by replacing

outpileup = allele_name + "." + all_pileup_file

with

outpileup = allele_name + "." + os.path.basename(all_pileup_file)

2/ I was getting a problem where the pileup file was empty but there was clearly coverage when viewing the bam file and sufficient reads were mapping. This seems to be a problem with the default options for samtool mpileup. I fixed this by adding the -A (map anomalous reads) option to line 622. While it may not be good practice to always consider unpaired reads, all of my reads were considered unpaired (all though they were a pe file). It maybe worth either making this the default option or providing an option to consider anomalous reads without having to dig into the code.

All the best,
Sion Bayliss

Nb. User error using getmlst.py script

I just wanted to note this user error which I managed to run afoul of in the past week. If one updates the MLST database files with the getmlst.py script in the same directory where an older version of the files is stored, it will overwrite the alleles files (tfa and fasta) and the ST definition text file, however it leaves the bowtie2 and samtools generated index files intact.

If you neglect to remember this and fail to remove these files manually, SRST2 will subsequently run quite happily and generate a slew of rather erroneous results using the bad index files and the new data files. As far as I can tell, no errors are generated, per se, but the resulting MLST calls are mostly NF*? due to terrible mismatches as the indices don't match the new data files.

It might be nice if the getmlst script removed the old indexes and perhaps even generated the new ones after downloading the MLST database files, but I leave that decision up to you.

Best,
S. Wesley Long

Problem with reading pileup using non-recommended versions of samtools & bowtie

From:
Paola Flórez de Sessions, PhD
Genome Institute of Singapore Efficient Rapid Microbial Sequencing (GERMS) platform leader

Hello Kathryn,

I have been attempting to run srst2 on an mrsa dataset. The output files look really great and I would love to have that information for my dataset. However, I seem to have some trouble. While the program appears to run successfully my *results.txt files is empty while the *.pileup file is populated.

$ bsub -M 1024 -W 560 -n 1 /mnt/software/bin/srst2 --input_pe /mnt/pnsg10_projects/desessions/mrsa/WNB012.h5/HS001-PE-R00149_AHAJ23ADXX.WNB012_ATCACG_L001_R1.fastq /mnt/pnsg10_projects/desessions/mrsa/WNB012.h5/HS001-PE-R00149_AHAJ23ADXX.WNB012_ATCACG_L001_R2.fastq --forward _R1 --reverse R2 --output WNB012_31PA --gene_db resistance.fas –log

We had to hack the version of samtools and bowtie as our data czars didn’t want to load older versions. Below is the *.pileup file.

$ [desessions@pegasus pileup]$ head WNB012_31PA__HS001-PE-R00149_AHAJ23ADXX.WNB012_ATCACG_L001.resistance.pileup
347__aph(3)-III__aph(3)-III__4_aph(3)-III_1_M26832_M26832_aminoglycosides 1 A 157 ^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^,^,^,^,^,^,^,^,^, BIIBIFI<IIIFIFFIFBBBFB<FFBFIFBFFFBIFIFIFIFIIBIIFIFIB7<IIFFFFFIFIBFFIBFFIII<FIIFIFIIIBIIFIIFFIIIIBIFF7IIFF7FIFIFFBIIFFFIIFIIIFIIIBFFFFFIFIIIIIFIIIFI<FBFFIIFFF
347__aph(3)-III__aph(3)-III__4_aph(3)-III_1_M26832_M26832_aminoglycosides 2 T 158 ...................................................................................................................................................,,,,,,,,,^.^. BIIBIFIIIFFIFBIFBFFIBFIF<FIFF<FFBFIIFIFIFIIBIIFIIIFBFIIFIIFIIFIFBFBFFFIII<FIFFIIIFIBIIFIIFFIIIIBIFFBIIIFFFIFIFFFIIBFFIIFIIIFFIIBFFFIBIIIIIIIIIIIFIFFBBBIIFFB<B
347__aph(3)-III__aph(3)-III__4_aph(3)-III_1_M26832_M26832_aminoglycosides 3 G 161 ...................................................................................................................................................,,,,,,,,,..^.^,^, FIBFIFFBIFIFIFFIFFFFIFIFFFIIFFBFFFIIIFIIIFFIFIIFIFIF<FIIIIIFI<FIFFFIFFFIIIFIIFIIIIIFIIFIIFFIIIIBIFF<IIFIFBIIIFFFIIFFBIIFIIIFFIIFFFFF<IBIIIIIIIIIFIFFBFFIIFFFBBB<<
347__aph(3)-III__aph(3)-III__4_aph(3)-III_1_M26832_M26832_aminoglycosides 4 G 159 ...............................................................................................................................................,,,,,,,,,...,,^.^. FIBFBIIBIFIFIF<IFFFFIFIIFFIIFFFIIFBIIFIIIBIF<IIIIFIFFIIIIIFIFBI<FBFFFIIII<FIIIIFIIFIIBIFFFIIIIFFIFIIIIIFIIIIFFIIFFIIFIIIFIIIFFBBIIFIIIIIIIIIFIFFBFIIIFFFBBB<<BB
347__aph(3)-III__aph(3)-III__4_aph(3)-III_1_M26832_M26832_aminoglycosides 5 C 168 ................................................................................................................................................,,,,,,,,,...,,..^.^.^.^.^.^.^.^. FIFFFIIIFIFIFFIFF7FIFIFFFFIFFFFBFBIIFIIIFIIBIIFIFIFFIIIIIIIB<IIF<FFIIII7FIIIIIIIBII7IIIFFIIIFIIF7IIIIIFIIIIFFIIBIFIIFIIIIFIIFFF<IBIFIIIIIIIIIFFIFBFIIIFFBFFBB<BBBBBBBBBB
347__aph(3)-III__aph(3)-III__4_aph(3)-III_1_M26832_M26832_aminoglycosides 6 T 177 ..................................................................................................................................................,,,,,,,,,...,,..........^.^.^.^.^.^.^. FIIFIII<IIIFIIFIFBFFIFIIFFFIFF7FFFBIIFIIIIFIBIIFIFIFFIIIIIIIB<I<IIFFIFIIIIIIBIIIIIFIIFIIIFFIIIFIII<IIFIIFIIIFFFIIFIIIIFIIIIIIIFBFBIBIBIIIIIIIIIFIFFFFIIIFFFFFFBBBBBBBBBBBBBBBBBBB
347__aph(3)-III__aph(3)-III__4_aph(3)-III_1_M26832_M26832_aminoglycosides 7 A 183 ....................................................................................................................................................,,,,,,,,,...,,.................^.^.^.^, FIIFFIIBIIIFIIBIFFFFFFIIFFFIFFFFIFFIIFIIIFIIFIIIIFIF7FIIIIIIIBFIBFIFFIIIIIBIII<IIIIIFIFFIFIIIIIIFFII<IIFFFFIIIIFFII7IIIIIIIIIIIIFFFFI<IFIIIIIIIIIFIIFBBFIIFFIFFFBBFFBBBBBBBBBBBBBBB????
347__aph(3)-III__aph(3)-III__4_aph(3)-III_1_M26832_M26832_aminoglycosides 8 A 184 ....................................................................................................................................................,,,,,,,,,...,,....................,^. FIIFIIIFIIIIIIFIFFFFFFIIFFBIIFFIFFFIIFIIIFII7IIIIIIF<FIIIIIFIFFIFFIFFIFIIIBIIIFIIIII<IIFIIIIIIIIFFII<IIIIIFIIIIFFIFFIFIIIIIIIIIIFFIFIFIFIIIIIIIIIFIFFFFFIIFFIFFFF<FFFFFFFFFFBB<BBBBAAAA;
347__aph(3)-III__aph(3)-III__4_aph(3)-III_1_M26832_M26832_aminoglycosides 9 A 189 ....................................................................................................................................................,,,,,,,,,...,,....................,.^.^.^.^.^, FIIFFIIBIIIIIIIIIFFFFFIFIFFIIFFFFFFIIIIIIIIIBIIIIIIFBIIIIIIIIBFIBBI<BIIIIIBIIIFIIIIIFIIIIIIIIIIIFFIIBIIFIFFIIIFFFIFFIFII<IIIIFIFFBFFIFIBIIIIIFIIIBIFFFBFIIIFFBFFF7FFFBFFFFFFFFFFFFFBBBB;=====
347__aph(3)-III__aph(3)-III__4_aph(3)-III_1_M26832_M26832_aminoglycosides 10 A 189 ..................................................................................................................................................,,,,,,,,,...,....................,.....,^.^.^, FIIFIIIBIFIIII<IIFIFIFIIIFFIIFBBFFFFIIIIFFIIBIIIIIIFBIIIIFIFIBFIBFIBFIIIIIBIIIBIIIIFFIFIIFIIIIIIFFIIBIFIFIIIIFIFIFIFIIBIIIIFIIB>>>>>BBE

Then when I check the results it looks like this:
$ [desessions@apollo1 Dec_1st_results]$ cat WNB012_31PA__genes__resistance__results.txt
Sample
HS001-PE-R00149_AHAJ23ADXX.WNB012_ATCACG_L001

The default minimum coverage remains 90% so in theory I should have results, right? Do you have any idea what could be happening?

Ps I have attached the resistance.fas in case you want to have a look.

Best,

Paola Flórez de Sessions, PhD
Genome Institute of Singapore Efficient Rapid Microbial Sequencing (GERMS) platform leader

UnboundLocalError for VFDB_cdhit_to_csv.py

Good day,

Upon running the VFDB_cdhit_to_csv.py against my cluster file, the following error ensued:

"Traceback (most recent call last):
File "../database_clustering/VFDB_cdhit_to_csv.py", line 67, in
sys.exit(main())
File "../database_clustering/VFDB_cdhit_to_csv.py", line 61, in main
outstring = ",".join([seqID, clusterid, gene, allele, str(record.seq), re.sub(",","",record.description)]) + "\n"
UnboundLocalError: local variable 'clusterid' referenced before assignment"

What could have caused the error?

Thank you very much,
Sarah

fullgenes table not reporting all samples when multiple readsets run in serial

if multiple readsets analysed in serial in a srst2 run, fullgenes reports the results for the last readset only as it was being rewritten each time

Discrepancy between top scoring allele in result file and consensus sequence file

There is a discrepancy between the top scoring allele indicated in the result file versus the consensus sequence file (using --report_all_consensus) when the "mlst_allele/trun" warning is called in the mismatches field.

I believe this is caused by the arguments given to the create_allele_pileup function (specifically top_allele) which does not take into account next_best_allele that is used for reporting when a truncation override is detected.

My apologies if this post is bad form, as I am absolutely brand new to git!

EcOH.fasta has 47 duplicate sequences & some inconsistences

I ran this command to check the database:

cd-hit-est -d 0 -i EcOH.fasta -c 1 -g 1 -o cdhit

and discovered 47 duplicate sequences (and some subsequences), including some which have confounding O/H types - here are some examples:

0       1251nt, >9__wzy__wzy-O123-Gp5__370... *
1       1251nt, >9__wzy__wzy-O123-Gp5__371... at +/100.00%
2       1251nt, >9__wzy__wzy-O123-Gp5__372... at +/100.00%
3       1251nt, >9__wzy__wzy-O186-Gp5__454... at +/100.00%

0       1230nt, >8__wzx__wzx-O153-Gp11__191... *
1       1230nt, >8__wzx__wzx-O178-Gp11__223... at +/100.00%

0       1071nt, >8__wzx__wzx-O28ac-Gp2__250... at +/100.00%
1       1071nt, >8__wzx__wzx-O42-Gp2__266... at +/100.00%
2       1230nt, >8__wzx__wzx-O42-Gp2__267... *

0       1221nt, >8__wzx__wzx-O118-Gp3__140... *
1       1221nt, >8__wzx__wzx-O118-Gp3__141... at +/100.00%
2       1221nt, >8__wzx__wzx-O151-Gp3__189... at +/100.00%

0       1149nt, >9__wzy__wzy-O129-Gp10__383... *
1       1149nt, >9__wzy__wzy-O13-Gp10__384... at +/100.00%
2       1149nt, >9__wzy__wzy-O135-Gp10__390... at +/100.00%
3       1149nt, >9__wzy__wzy-O135-Gp10__391... at +/100.00%

0       1053nt, >9__wzy__wzy-O111__351... *
1       1053nt, >9__wzy__wzy-O7__524... at +/100.00%

0       1074nt, >9__wzy__wzy-O153-Gp11__412... at +/100.00%
1       1098nt, >9__wzy__wzy-O178-Gp11__444... *

0       1380nt, >9__wzy__wzy-O28ac-Gp2__471... *
1       1380nt, >9__wzy__wzy-O42-Gp2__487... at +/100.00%
2       1380nt, >9__wzy__wzy-O42-Gp2__488... at +/100.00%

0       1332nt, >9__wzy__wzy-O17-Gp9__434... *
1       1332nt, >9__wzy__wzy-O44-Gp9__490... at +/100.00%
2       1332nt, >9__wzy__wzy-O77-Gp9__531... at +/100.00%

0       1191nt, >9__wzy__wzy-O18-Gp12__446... *
1       1146nt, >9__wzy__wzy-O18ac__457... at +/100.00%

srst2.py throws error if allele has slash character

I ran srst2 against a gene database that happened to have some slash characters for some of the read names:

70__aec27/clpV__aec27/clpV_EC042_0215__R033079 R033079 aec27/clpV (EC042_0215) - putative type VI secretion system protein [Escherichia coli str. 042 (EAEC O44:H18)]

ATGATCCAGATTGATTTAGCCACGCTGGTAAAGCGGCTTAACCCCTTTGCAAAACAGGCG
...

This causes srst2.py to crash when generating pileups:

Traceback (most recent call last):
File "/usr/local/bin/srst2", line 9, in
load_entry_point('srst2==0.1.5', 'console_scripts', 'srst2')()
File "/usr/local/lib/python2.7/dist-packages/srst2/srst2.py", line 1548, in main
db_reports, db_results = run_srst2(args,fileSets,args.gene_db,"genes")
File "/usr/local/lib/python2.7/dist-packages/srst2/srst2.py", line 1102, in run_srst2
db_reports, db_results_list = process_fasta_db(args, fileSets, run_type, db_reports, db_results_list, fasta)
File "/usr/local/lib/python2.7/dist-packages/srst2/srst2.py", line 1164, in process_fasta_db
unique_gene_symbols, unique_allele_symbols,run_type,ST_db,results,gene_list,db_report,cluster_symbols,max_mismatch)
File "/usr/local/lib/python2.7/dist-packages/srst2/srst2.py", line 1275, in map_fileSet_to_db
unique_gene_symbols, unique_allele_symbols, pileup_file)
File "/usr/local/lib/python2.7/dist-packages/srst2/srst2.py", line 859, in parse_scores
allele_pileup_file = create_allele_pileup(top_allele, pileup_file) # XXX Creates a new pileup file for that allele. Not currently cleaned up
File "/usr/local/lib/python2.7/dist-packages/srst2/srst2.py", line 765, in create_allele_pileup
with open(outpileup, 'w') as allele_pileup:
IOError: [Errno 2] No such file or directory: '67__aec27/clpV__aec27/clpV_G2583_0230__R033067.ERR024627__ERR024627.Escherichia_VF_clustered.pileup'

Possible solution is to just change the slashes to underscores

PIP doesn't have scipy listed as dependency

I did a "pip install srst2/" from a git clone, using virtualenv (blank python canvas):

File "/bio/sw/srst2/local/lib/python2.7/site-packages/srst2/srst2.py", line 19, in
from scipy.stats import binom_test, linregress
ImportError: No module named scipy.stats

I then try to install scipy via PIP, but fails due to lack of numpy.

Once I install numpy, then scipy works, and srst2 works (well, i get the help screen).

I think the PIP file needs to have its dependencies listed in it somehow?

getmlst.py crashes with a cryptic error message when downloading "Campylobacter fetus"

When I run this command:

getmlst.py --species "Campylobacter fetus"

I get this error:

For SRST2, remember to check what separator is being used in this allele database
Traceback (most recent call last):
File "/usr/local/bin/getmlst.py", line 183, in
main()
File "/usr/local/bin/getmlst.py", line 173, in main
m = re.match('>(.)([-])(\d_)',head).groups()
AttributeError: 'NoneType' object has no attribute 'groups'

It seems like everything downloaded though, but cfetus.txt doesn't seem to contain any profiles

root@ac486a86e05b:/home# ls
C_fe_tkt.tfa Campylobacter_fetus.fasta Cfe_aspA.tfa Cfe_glnA.tfa Cfe_gltA.tfa Cfe_glyA.tfa Cfe_pgm.tfa Cfe_uncA.tfa cfetus.txt mlst_data_download_Campylobacter_fetus_2014-11-29.log

root@ac486a86e05b:/home# cat cfetus.txt
Scheme 9 is not available.

This probably isn't something that can be fixed on the SRST2 side since this species doesn't even have any profiles, but it would be nice to have a more meaningful error message when encountering this error mode.

Crash when trying to create novel allele consensus fasta from ARG database

Howdy all,

Quick question, I am trying to recover the novel allele sequences from some samples being run against the ARG database included with SRST2. I'm getting the following error when trying to create the allele pileup:

06/17/2015 12:44:06 Generate pileup...
06/17/2015 12:44:06 Running: samtools mpileup -L 1000 -f /home/mresswl/archive/ARGFinder_DB/ARGannot.fasta -Q 20 -q 1 -B data_out/KPN166ec__KPN166ec.ARGannot.sorted.bam
[mpileup] 1 samples in 1 input files
Set max per-file depth to 8000
06/17/2015 12:44:11 Processing SAMtools pileup...
06/17/2015 12:44:34 Scoring alleles...
Traceback (most recent call last):
File "/share/apps/srst2/bin/srst2.py", line 1592, in
main()
File "/share/apps/srst2/bin/srst2.py", line 1548, in main
db_reports, db_results = run_srst2(args,fileSets,args.gene_db,"genes")
File "/share/apps/srst2/bin/srst2.py", line 1102, in run_srst2
db_reports, db_results_list = process_fasta_db(args, fileSets, run_type, db_reports, db_results_list, fasta)
File "/share/apps/srst2/bin/srst2.py", line 1164, in process_fasta_db
unique_gene_symbols, unique_allele_symbols,run_type,ST_db,results,gene_list,db_report,cluster_symbols,max_mismatch)
File "/share/apps/srst2/bin/srst2.py", line 1275, in map_fileSet_to_db
unique_gene_symbols, unique_allele_symbols, pileup_file)
File "/share/apps/srst2/bin/srst2.py", line 859, in parse_scores
allele_pileup_file = create_allele_pileup(top_allele, pileup_file) # XXX Creates a new pileup file for that allele. Not currently cleaned up
File "/share/apps/srst2/bin/srst2.py", line 765, in create_allele_pileup
with open(outpileup, 'w') as allele_pileup:
IOError: [Errno 2] No such file or directory: '77__OqxA_Flq__OqxA__1044.data_out/KPN166ec__KPN166ec.ARGannot.pileup'

So it seems like I'm having a problem creating a file / directory but I'm not certain why that is... the script is able to generate all the other output files it needs to... this is only an error when I try to run with --report_new_consensus. Any help is appreciated. This is happening for every paired end readset I attempt to generate the --report_new_consensus reports for.

Full program output is appended below, in case it is helpful.

06/17/2015 12:39:54 program started
06/17/2015 12:39:54 command line: /share/apps/srst2/bin/srst2.py --input_pe data_in/KPN166ec_1.fastq.gz data_in/KPN166ec_2.fastq.gz --report_new_consensus --gene_db /home/mresswl/archive/ARGFinder_DB/ARGannot.fasta --output data_out/KPN166ec
06/17/2015 12:39:54 Total paired readsets found:1
06/17/2015 12:39:54 Index for /home/mresswl/archive/ARGFinder_DB/ARGannot.fasta is already built...
06/17/2015 12:39:54 Processing database /home/mresswl/archive/ARGFinder_DB/ARGannot.fasta
06/17/2015 12:39:54 Non-unique:55, CmlA_PheCmlA5
06/17/2015 12:39:54 Non-unique:114, CfrA_Phe
06/17/2015 12:39:54 Processing sample KPN166ec
06/17/2015 12:39:54 Output prefix set to: data_out/KPN166ec__KPN166ec.ARGannot
06/17/2015 12:39:54 Aligning reads to index /home/mresswl/archive/ARGFinder_DB/ARGannot.fasta using bowtie2...
06/17/2015 12:39:54 Running: bowtie2 -1 data_in/KPN166ec_1.fastq.gz -2 data_in/KPN166ec_2.fastq.gz -S data_out/KPN166ec__KPN166ec.ARGannot.sam -q --very-sensitive-local --no-unal -a -x /home/mresswl/archive/ARGFinder_DB/ARGannot.fasta
1764353 reads; of these:
1764353 (100.00%) were paired; of these:
1760384 (99.78%) aligned concordantly 0 times
2397 (0.14%) aligned concordantly exactly 1 time
1572 (0.09%) aligned concordantly >1 times
----
1760384 pairs aligned concordantly 0 times; of these:
156 (0.01%) aligned discordantly 1 time
----
1760228 pairs aligned 0 times concordantly or discordantly; of these:
3520456 mates make up the pairs; of these:
3517876 (99.93%) aligned 0 times
1544 (0.04%) aligned exactly 1 time
1036 (0.03%) aligned >1 times
0.31% overall alignment rate
06/17/2015 12:43:47 Processing Bowtie2 output with SAMtools...
06/17/2015 12:43:47 Generate and sort BAM file...
06/17/2015 12:43:47 Running: samtools view -b -o data_out/KPN166ec__KPN166ec.ARGannot.unsorted.bam -q 1 -S data_out/KPN166ec__KPN166ec.ARGannot.sam.mod
[samopen] SAM header is present: 1683 sequences.
06/17/2015 12:43:50 Running: samtools sort data_out/KPN166ec__KPN166ec.ARGannot.unsorted.bam data_out/KPN166ec__KPN166ec.ARGannot.sorted
06/17/2015 12:44:06 Deleting sam and bam files that are not longer needed...
06/17/2015 12:44:06 Deleting data_out/KPN166ec__KPN166ec.ARGannot.sam
06/17/2015 12:44:06 Deleting data_out/KPN166ec__KPN166ec.ARGannot.sam.mod
06/17/2015 12:44:06 Deleting data_out/KPN166ec__KPN166ec.ARGannot.unsorted.bam
06/17/2015 12:44:06 Generate pileup...
06/17/2015 12:44:06 Running: samtools mpileup -L 1000 -f /home/mresswl/archive/ARGFinder_DB/ARGannot.fasta -Q 20 -q 1 -B data_out/KPN166ec__KPN166ec.ARGannot.sorted.bam
[mpileup] 1 samples in 1 input files
Set max per-file depth to 8000
06/17/2015 12:44:11 Processing SAMtools pileup...
06/17/2015 12:44:34 Scoring alleles...
Traceback (most recent call last):
File "/share/apps/srst2/bin/srst2.py", line 1592, in
main()
File "/share/apps/srst2/bin/srst2.py", line 1548, in main
db_reports, db_results = run_srst2(args,fileSets,args.gene_db,"genes")
File "/share/apps/srst2/bin/srst2.py", line 1102, in run_srst2
db_reports, db_results_list = process_fasta_db(args, fileSets, run_type, db_reports, db_results_list, fasta)
File "/share/apps/srst2/bin/srst2.py", line 1164, in process_fasta_db
unique_gene_symbols, unique_allele_symbols,run_type,ST_db,results,gene_list,db_report,cluster_symbols,max_mismatch)
File "/share/apps/srst2/bin/srst2.py", line 1275, in map_fileSet_to_db
unique_gene_symbols, unique_allele_symbols, pileup_file)
File "/share/apps/srst2/bin/srst2.py", line 859, in parse_scores
allele_pileup_file = create_allele_pileup(top_allele, pileup_file) # XXX Creates a new pileup file for that allele. Not currently cleaned up
File "/share/apps/srst2/bin/srst2.py", line 765, in create_allele_pileup
with open(outpileup, 'w') as allele_pileup:
IOError: [Errno 2] No such file or directory: '77__OqxA_Flq__OqxA__1044.data_out/KPN166ec__KPN166ec.ARGannot.pileup'

add example command and web link to the text printed upon installation

Use a specific version of samtools (and others)

Hi @katholt,

I wanted to test your temperature on a proposed contribution before raising a pull request. I hope you don't mind. To be clear, I'm happy to make a pull request, I just wanted to check that you'd be happy with the principle before I do the work.

We've got a few different versions of samtools installed but the default is 0.1.19; I'd like to use version 0.1.18 with srst2*. There are a few ways I could do this, but the neatest would be to set an environment variable (e.g. SRST2_SAMTOOLS) to point to the preferred binary and then to set this in .bashrc or .bash_profile. What do you think?

I'm happy to make the change myself, if it's something you'd be interested in merging. I think it could possibly be useful to other people as well.

I'd suggest that I make changes to [scripts/srs2.py] and [scripts/slurm_srst2.py] to check if the environment variable is set otherwise default to vanilla 'samtools'. For consistency, I could do something similar for bowtie2-build if you like.

Let me know if this is a contribution you'd be interested in otherwise I'll have a think about something a bit hackier that I could do my end.

Many thanks,

Ben

OK so actually my situation is a bit more complicated in that I'm trying to help my colleagues use this software on a central machine and make it so that for this piece of software they use samtools-0.1.18 but for other bits of software they use a different version, preferably without them needing to notice that they're doing so.

Change default behaviour to throw away pileups & final sorted bam

Add --keep_final_alignment option

if this and/or keep_interim_alignment is switched on, keep final sorted bam & pileups

otherwise delete pileup & bam

The module logging is necessary for slurm_srst2.py

The current version does not import this module, causing an error "global name 'logging' is not defined". A pull request has been submitted to fix this problem.

Miscalling alleles with only indels differences between variants

This will be a long post so the quick summary is that the new version of SRST2 added a new parameter which causes the miss prediction of allele ldh from 1 to 197. It effect only results where the different their is larger in length between different variant of an allele. Parameter in question is called "--mlst_max_mismatch" with default value of 10.

Possible solutions

Allow users to turn off "mlst_max_mismatch"

Details:

A few new parameters were added to SRST2 between the different version from 0.2.0 to 0.2.1 . One of theses parameters is called "--mlst_max_mismatch". Description from the srst2.py --help is the following :

--mlst_max_mismatch MLST_MAX_MISMATCH

Maximum number of mismatches per read for MLST allele calling (default 10)

The parameter will filter reads AFTER they have been mapped with bowtie2 based on the CIGAR string. If it see that it has more then 10 mismatch/indels, it throws out the read from the alignment completely. So another words, pre-filtering data before it starts determining the alleles.

However, this pre-filtering removes data that helps with making the create call in the case with ldh alelle.

Below is a screenshot of the alignment between ldh 1 and ldh 197. They are 100% identical but ldh_1 has an extra 19 base pairs near the end.

Since both variants of the alleles have a length differences of 19 base pairs, all reads aligning to ldh_197 that have the extra 19 base pairs are thrown out but should be kept! Attached is the pileup view of the region in question against ldh_197. Top windows is the old version of srst 0.20 and bottom is 0.2.1. As you can see, almost all reads are thrown out in that region.

So when it comes time to score each allele variants,both ldh_1 and ldh_197 both report having no indels when it 197 variant should have 19!

Crash during processing of large number of compiledResults.txt files into a single output

Running into an error while trying to use --prev_output to aggregate a large number of compiledResults from MLST & Antibiotic resistance gene database into a single report. Sample size is 932 samples. Following error is thrown during the processing phase:

Traceback (most recent call last):
File "/share/apps/srst2/bin/srst2.py", line 1592, in
main()
File "/share/apps/srst2/bin/srst2.py", line 1583, in main
compile_results(args,mlst_results_hashes,gene_result_hashes,compiled_output_file)
File "/share/apps/srst2/bin/srst2.py", line 1439, in compile_results
this_st = mlst_results_master[sample].split("\t")[1]
IndexError: list index out of range

I'm assuming it's something about the sheer number of samples? or perhaps the compiledResults files.

Best regards,
S. Wesley Long

grep gene name syntax error

Hi,

I have this issue at the end of running srst2 using resistance gene database. I don't find any problem with the result but I want to clarify this error with you. Is this error a significant one or it just output error with the gene naming but doesn't affect the result?

sh: -c: line 0: syntax error near unexpected token (' sh: -c: line 0:grep 194__mph(A)_1_D16251__mph(A)_1_D16251__00057 /database/resfinder2016_2.fasta'

Thanks,
Cheng

example.txt is missing the confident ST after line 106

On line 106 of the example.txt file, it says:

This time there is a confident ST called, which should look like this:

But then it doesn't actually have an example of what the confident ST should look like.

default path to srst2 in the slurm_srst2.py script is 'srst2.py' but should be 'srst2'

This has been fixed in the master branch, but not yet in the release.
It will have to wait until the next release.

Allow bowtie2 versions higher than 2.1.0

Currently we just check for 2.1.0, but I have tested and found it works with the more recent higher versions of bowtie2.
So check_command_version function needs to be modified to allow any versions higher than this FOR BOWTIE2 BUT NOT FOR SAMTOOLS as we need to enforce samtools 0.1.18 and not allow 0.1.19

mlst results file only contains header when running certain samples

Here are the commands I ran:

getmlst.py --species "Escherichia coli#1"
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR028/ERR028698/ERR028698_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR028/ERR028698/ERR028698_2.fastq.gz
srst2 --input_pe ERR028698*.fastq.gz --output shigella1 --log --save_scores --mlst_db Escherichia_coli#1.fasta --mlst_definitions ecoli.txt

This is the output I get:
Sample ST adk fumC gyrB icd mdh purA recA mismatches uncertainty depth maxMAF

Note that there is only a header and not even the sample name is printed

Here is the log, which does not indicate any errors:

01/24/2015 19:22:24 program started
01/24/2015 19:22:24 command line: /usr/local/bin/srst2 --input_pe ERR028698_1.fastq.gz ERR028698_2.fastq.gz --output shigella1 --log --save_scores --mlst_db Escherichia_coli#1.fasta --mlst_definitions ecoli.txtsrst2 --input_pe ERR028698_1.fastq.gz ERR028698_2.fastq.gz --output shigella1 --log --save_scores --mlst_db Escherichia_coli#1.fasta --mlst_definitions ecoli.txt
01/24/2015 19:22:24 Total paired readsets found:1
01/24/2015 19:22:24 Building bowtie2 index for Escherichia_coli#1.fasta...
01/24/2015 19:22:24 Running: bowtie2-build Escherichia_coli#1.fasta Escherichia_coli#1.fasta
01/24/2015 19:22:29 Processing database Escherichia_coli#1.fasta
01/24/2015 19:22:29 Running: samtools faidx Escherichia_coli#1.fasta
01/24/2015 19:22:30 Processing sample ERR028698
01/24/2015 19:22:30 Output prefix set to: shigella1__ERR028698.Escherichia_coli#1
01/24/2015 19:22:30 Aligning reads to index Escherichia_coli#1.fasta using bowtie2...
01/24/2015 19:22:30 Running: bowtie2 -1 ERR028698_1.fastq.gz -2 ERR028698_2.fastq.gz -S shigella1__ERR028698.Escherichia_coli#1.sam -q --very-sensitive-local --no-unal -a -x Escherichia_coli#1.fasta
01/24/2015 19:22:32 Processing Bowtie2 output with SAMtools...
01/24/2015 19:22:32 Generate and sort BAM file...
01/24/2015 19:22:32 Running: samtools view -b -o shigella1__ERR028698.Escherichia_coli#1.unsorted.bam -q 1 -S shigella1__ERR028698.Escherichia_coli#1.sam.mod
01/24/2015 19:22:32 Running: samtools sort shigella1__ERR028698.Escherichia_coli#1.unsorted.bam shigella1__ERR028698.Escherichia_coli#1.sorted
01/24/2015 19:22:32 Deleting sam and bam files that are not longer needed...
01/24/2015 19:22:32 Deleting shigella1__ERR028698.Escherichia_coli#1.sam
01/24/2015 19:22:32 Deleting shigella1__ERR028698.Escherichia_coli#1.sam.mod
01/24/2015 19:22:32 Deleting shigella1__ERR028698.Escherichia_coli#1.unsorted.bam
01/24/2015 19:22:32 Generate pileup...
01/24/2015 19:22:32 Running: samtools mpileup -L 1000 -f Escherichia_coli#1.fasta -Q 20 -q 1 shigella1__ERR028698.Escherichia_coli#1.sorted.bam
01/24/2015 19:22:32 Processing SAMtools pileup...
01/24/2015 19:22:33 Scoring alleles...
01/24/2015 19:22:39 Finished processing for read set ERR028698 ...
01/24/2015 19:22:39 Finished processing for database Escherichia_coli#1.fasta ...
01/24/2015 19:22:39 MLST output printed to shigella1__mlst__Escherichia_coli#1__results.txt
01/24/2015 19:22:39 SRST2 has finished.

create_allele_pileup crashes if --output contained a directory path

I ran this command:

"srst2 --input_pe /data/scratch/Bcereus-15/Bcereus-15_1.fastq.gz /data/scratch/Bcereus-15/Bcereus-15_2.fastq.gz --output /data/output/appresults/14091077/Bcereus-15/Bcereus-15 --save_scores --report_all_consensus --mlst_db /opt/gene_databases/mlst/11302014/Bacillus_cereus/Bacillus_cereus.fasta --mlst_definitions /opt/gene_databases/mlst/11302014/Bacillus_cereus/bcereus.txt --gene_db /opt/srst2/data/ARGannot.fasta /opt/srst2/data/PlasmidFinder.fasta /opt/gene_databases/vfdb/11302014/Bacillus/Bacillus_VF_clustered.fasta"

Note that --output contains a directory in the prefix because I want to write the outputs to that folder instead of the cwd. This apparently causes problems if you turn on --report_all_consensus because create_allele_pileup tries to just concatenate the allele name to the front of the directory name:

def create_allele_pileup(allele_name, all_pileup_file):
outpileup = allele_name + "." + all_pileup_file

which results in this error:

12/05/2014 08:11:09 Processing SAMtools pileup...
12/05/2014 08:11:14 Scoring alleles...
Traceback (most recent call last):
File "/usr/local/bin/srst2", line 9, in
load_entry_point('srst2==0.1.4', 'console_scripts', 'srst2')()
File "/usr/local/lib/python2.7/dist-packages/srst2/srst2.py", line 1508, in main
mlst_report, mlst_results = run_srst2(args,fileSets,args.mlst_db,"mlst")
File "/usr/local/lib/python2.7/dist-packages/srst2/srst2.py", line 1074, in run_srst2
db_reports, db_results_list = process_fasta_db(args, fileSets, run_type, db_reports, db_results_list, fasta)
File "/usr/local/lib/python2.7/dist-packages/srst2/srst2.py", line 1136, in process_fasta_db
unique_gene_symbols, unique_allele_symbols,run_type,ST_db,results,gene_list,db_report,cluster_symbols,max_mismatch)
File "/usr/local/lib/python2.7/dist-packages/srst2/srst2.py", line 1247, in map_fileSet_to_db
unique_gene_symbols, unique_allele_symbols, pileup_file)
File "/usr/local/lib/python2.7/dist-packages/srst2/srst2.py", line 831, in parse_scores
allele_pileup_file = create_allele_pileup(top_allele, pileup_file) # XXX Creates a new pileup file for that allele. Not currently cleaned up
File "/usr/local/lib/python2.7/dist-packages/srst2/srst2.py", line 737, in create_allele_pileup
with open(outpileup, 'w') as allele_pileup:
IOError: [Errno 2] No such file or directory: 'pta_102./data/output/appresults/14091077/Bcereus-15/Bcereus-15__Bcereus-15.Bacillus_cereus.pileup'

One possible solution is to use the os.path module to split the all_pileup_file string into the directory and filename, concatenate allele_name to the filename, and the rejoin the directory with the new filename.

Analyzing NextSEq data

Hi Kat,

I got NextSeq data which contains 4X forward fastq files and 4X reverse fastq files, in total 8 files for each sample. I concatenated 4 forward fastq files and 4 reverse fastq files to one forward and reverse fastq files, and run SRST2 but it seems like not processing the run.

Please tell me how to handle such multiple fastq files for SRST2.

Best regards,
Tomi

Add 'How it works' section to doc

Allow forward and reverse flags as file name

Hi SRST2 team,

Your tool is amazing, very neat and fast.
I'm currently using SRST2 for my illumina data, paired end data.

I've files named as (example): 111_CN_04_B26_M3_C8_P1_Kleb_TATGTGGC_L005_R1_001.fastq.gz These were not accepted as
--input_pe ../../sample_isolates_test/111_CN_04_B26_M3_C8_P1_Kleb_TATGTGGC_L005*.fastq.gz
This was probably, as the naming standardized in srst2.py is different.

I was able to run as:
python srst2.py --input_pe ../../sample_isolates_test/111_CN_04_B26_M3_C8_P1_Kleb_TATGTGGC_L005_R1_001.fastq.gz ../../sample_isolates_test/111_CN_04_B26_M3_C8_P1_Kleb_TATGTGGC_L005_R2_001.fastq.gz --forward 111_CN_04_B26_M3_C8_P1_Kleb_TATGTGGC_L005_R1_001 --reverse 111_CN_04_B26_M3_C8_P1_Kleb_TATGTGGC_L005_R2_001

This (above case) was handy, since I'm playing around with tool, and data and can provide prefixes (manually). I'm afraid, things might go wrong, when working on ~1000 isolates, in providing prefixes as in above fashion.

Can something be done in script to overcome/ignore this? This seems redundant; provide paired end flag, read1 file name, read2 file name and prefixes for forward and reverse reads.

Something Like:
python srst2.py --input_pe --forward ../../sample_isolates_test/111_CN_04_B26_M3_C8_P1_Kleb_TATGTGGC_L005_R1_001.fastq.gz --reverse ../../sample_isolates_test/111_CN_04_B26_M3_C8_P1_Kleb_TATGTGGC_L005_R2_001.fastq.gz

katholt / srst2 Goto Github PK

srst2's Introduction

SRST2

Contents

Current release - v0.2.0 - July 28, 2016

Updates in current master branch (not yet in a release)

Installation

1 - Install dependencies

2 - Get and install the code

3 - Test that the programs are installed properly

Basic usage - MLST

1 - Gather your input files

2 - Run MLST

3 - Check the outputs

Basic usage - Resistance genes

1 - Gather your input files

2 - Run gene detection

3 - Check the outputs

All usage options

Input read formats and options

Read formats

Read names

Paired reads

Sample names

If you have multiple read sets for the same sample

MLST Database format

Allele sequences file, fasta format.

MLST definitions file, tab delimited format.

Gene databases

Output files

MLST results

Gene typing

Combined results

Mapping results

Printing consensus sequences

Report consensus sequences for novel alleles

Report consensus sequences for all alleles

Collate consensus sequences output by SRST2 run on multiple strains & loci, into one file per locus

More basic usage examples

Compile results from completed runs

Running lots of jobs and compiling results

Known issues

Generating SRST2-compatible clustered database from raw sequences

Gene database format

Sourcing suitable gene databases

If you already know which alleles belong to the same gene family and should be clustered for reporting:

Clustering sequences

Screening for resistance genes with SRST2

Using the VFDB Virulence Factor Database with SRST2

Preformatted databases for specialist typing with SRST2

Using the EcOH database for serotyping E. coli with SRST2

Basic Usage – Serotyping E. coli

Example

Interpretation

Using the GBS-SBG database for serotyping Streptococcus agalactiae (Group B Streptococcus) with SRST2

Basic Usage - Serotyping GBS

Example

Interpretation

Typing the LEE pathogenicity island of E. coli

Basic Usage – Typing E. coli LEE variants

Example

Interpretation

Plotting output in R

srst2's People

Contributors

Stargazers

Watchers

Forkers

srst2's Issues

Error in step: Using the VFDB Virulence Factor Database with SRST2

1: The URL to the VFDB has changed and, using wget, should now be:

2: We need three extra steps for the database parsing scripts to work.

Recommend Projects

Recommend Topics

Recommend Org