sanger-pathogens / bio-tradis Goto Github PK

A set of tools to analyse the output from TraDIS analyses

Home Page: https://sanger-pathogens.github.io/Bio-Tradis/

License: Other

Perl 92.42% Gnuplot 0.49% R 4.31% Shell 1.58% Dockerfile 1.19%

genomics sequencing next-generation-sequencing research bioinformatics bioinformatics-pipeline global-health infectious-diseases pathogen

bio-tradis's Introduction

Bio-Tradis

A set of tools to analyse the output from TraDIS analyses

Introduction
Installation
Usage
License
Feedback/Issues
Citation

Introduction

The Bio::TraDIS pipeline provides software utilities for the processing, mapping, and analysis of transposon insertion sequencing data. The pipeline was designed with the data from the TraDIS sequencing protocol in mind, but should work with a variety of transposon insertion sequencing protocols as long as they produce data in the expected format.

For more information on the TraDIS method, see http://bioinformatics.oxfordjournals.org/content/32/7/1109 and http://genome.cshlp.org/content/19/12/2308.

Installation

Bio-Tradis has the following dependencies:

Required dependencies

bwa
smalt
samtools
tabix
R
Bioconductor

There are a number of ways to install Bio-Tradis and details are provided below. If you encounter an issue when installing Bio-Tradis please contact your local system administrator. If you encounter a bug please log it here or email us at [email protected].

Bioconda

Install conda and enable the bioconda channel.

conda install -c bioconda biotradis=1.4.5=1

Docker

Bio-Tradis can be run in a Docker container. First install Docker, then install Bio-Tradis:

docker pull sangerpathogens/bio-tradis

To use Bio-Tradis use a command like this (substituting in your directories), where your files are assumed to be stored in /home/ubuntu/data:

docker run --rm -it -v /home/ubuntu/data:/data sangerpathogens/bio-tradis bacteria_tradis -h

Running the tests

The test can be run with dzil from the top level directory:

dzil test

Usage

For command-line usage instructions, please see the tutorial in the file "BioTraDISTutorial.pdf". Note that default parameters are for comparative experiments, and will need to be modified for gene essentiality studies.

Bio-Tradis provides functionality to:

detect TraDIS tags in a BAM file
add the tags to the reads
filter reads in a FastQ file containing a user defined tag
remove tags
map to a reference genome
create an insertion site plot file

The functions are avalable as standalone scripts or as perl modules.

Scripts

Executable scripts to carry out most of the listed functions are available in the bin:

check_tradis_tags - Prints 1 if tags are present, prints 0 if not.
add_tradis_tags - Generates a BAM file with tags added to read strings.
filter_tradis_tags - Create a fastq file containing reads that match the supplied tag
remove_tradis_tags - Creates a fastq file containing reads with the supplied tag removed from the sequences
tradis_plot - Creates an gzipped insertion site plot
bacteria_tradis - Runs complete analysis, starting with a fastq file and produces mapped BAM files and plot files for each file in the given file list and a statistical summary of all files. Note that the -f option expects a text file containing a list of fastq files, one per line. This script can be run with or without supplying tags.

A help menu for each script can be accessed by running the script with no parameters.

Analysis Scripts

Three scripts are provided to perform basic analysis of TraDIS results in bin:

tradis_gene_insert_sites - Takes genome annotation in embl format along with plot files produced by bacteria_tradis and generates tab-delimited files containing gene-wise annotations of insert sites and read counts.
tradis_essentiality.R - Takes a single tab-delimited file from tradis_gene_insert_sites to produce calls of gene essentiality. Also produces a number of diagnostic plots.
tradis_comparison.R - Takes tab files to compare two growth conditions using edgeR. This analysis requires experimental replicates.

Internal Objects and Methods

Bio::Tradis::DetectTags

Required parameters:
- bamfile - path to/name of file to check
Methods:
- tags_present - returns true if TraDIS tags are detected in bamfile

Bio::Tradis::AddTagsToSeq

Required parameters:
- bamfile - path to/name of file containing reads and tags
Optional parameters:
- outfile - defaults to file.tr.bam for an input file named file.bam
Methods:
- add_tags_to_seq - add TraDIS tags to reads. For unmapped reads, the tag is added to the start of the read sequence and quality strings. For reads where the flag indicates that it is mapped and reverse complemented, the reverse complemented tags are added to the end of the read strings. This is because many conversion tools (e.g. picard) takes the read orientation into account and will re-reverse the mapped/rev comp reads during conversion, leaving all tags in the correct orientation at the start of the sequences in the resulting FastQ file.

Bio::Tradis::FilterTags

Required parameters:
- fastqfile - path to/name of file to filter. This may be a gzipped fastq file, in which case a temporary unzipped version is used and removed on completion.
- tag - TraDIS tag to match
Optional parameters:
- mismatch - number of mismatches to allow when matching the tag. Default = 0
- outfile - defaults to file.tag.fastq for an input file named file.fastq
Methods:
- filter_tags - output all reads containing the tag to outfile

Bio::Tradis::RemoveTags

Required parameters:
- fastqfile - path to/name of file to filter.
- tag - TraDIS tag to remove
Optional parameters:
- mismatch - number of mismatches to allow when removing the tag. Default = 0
- outfile - defaults to file.rmtag.fastq for and input file named file.fastq
Methods:
- remove_tags - output all reads with the tags removed from both sequence and quality strings to outfile

Bio::Tradis::Map

Required parameters:
- fastqfile - path to/name of file to map to the reference
- reference - path to/name of reference genome in fasta format (.fa)
Optional parameters:
- refname - name to assign to the reference index files. Default = ref.index
- outfile - name to assign the mapped SAM file. Default = mapped.sam
Methods:
- index_ref - create index files of the reference genome. These are required for the mapping step. Only skip this step if index files already exist. If SMALT is used as the aligner -sk and -ss options for referencing are calculated based on the length of the reads being mapped:
  - <70 : -sk 13 -ss 4
  - 70 & <100 : -sk 13 -ss 6
  - 100 : -sk 20 -ss 13
- do_mapping - map fastqfile to reference. Options used for mapping are: -k the min seed length for BWA -s (for using SMALT as alternative aligner) -r -1, -x and -y 0.96 for SMALT (see SMALT manual)
For more information on the mapping and indexing options discussed here, see the BWA manual (http://rothlab.ucdavis.edu/howto/attachments/bwa_manpage.pdf) and/or SMALT manual (ftp://ftp.sanger.ac.uk/pub4/resources/software/smalt/smalt-manual-0.7.4.pdf)

Bio::Tradis::TradisPlot

Required parameters:
- mappedfile - mapped and sorted BAM file
Optional parameters:
- outfile - base name to assign to the resulting insertion site plot. Default = tradis.plot
- mapping_score - cutoff value for mapping score. Default = 30
Methods:
- plot - create insertion site plots for reads in mappedfile. This file will be readable by the Artemis genome browser (http://www.sanger.ac.uk/resources/software/artemis/)

Bio::Tradis::RunTradis

Required parameters:
- fastqfile - file containing a list of fastqs (gzipped or raw) to run the complete analysis on. This includes all (including intermediary format conversion and sorting) steps starting from filtering and, finally, producing an insertion site plot and a statistical summary of the analysis.
- tag - TraDIS tag to filter for and then remove
- reference - path to/name of reference genome in fasta format (.fa)
Optional parameters:
- mismatch - number of mismatches to allow when filtering/removing the tag. Default = 0
- tagdirection - direction of the tag, 5' or 3'. Default = 3
- mapping_score - cutoff value for mapping score. Default = 30
Methods:
- run_tradis - run complete analysis

Perl Programming Examples

You can reuse the Perl modules as part of other Perl scripts. This section provides example Perl code. Check whether file.bam contains TraDIS tag fields and, if so, adds the tags to the reads' sequence and quality strings.

my $detector = Bio::Tradis::DetectTags(bamfile => 'file.bam');
if($detector->tags_present){
	Bio::Tradis::AddTagsToSeq(bamfile => 'file.bam', outfile => 'tradis.bam')->add_tags_to_seq;
}

Filter a FastQ file with TraDIS tags attached for those matching the given tag. Then, remove the same tag from the start of all sequences in preparation for mapping.

Bio::Tradis::FilterTags(
	fastqfile => 'tradis.fastq',
	tag => 'TAAGAGTGAC', 
	outfile => 'filtered.fastq'
)->filter_tags;
Bio::Tradis::RemoveTags(
	fastqfile => 'filtered.fastq',
	tag => 'TAAGAGTGAC', 
	outfile => 'notags.fastq'
)->remove_tags;

Create mapping object, index the given reference file and then map the fastq file to the reference. This will produce index files for the reference and a mapped SAM file named tradis_mapped.sam.

my $mapping = Bio::Tradis::Map(
	fastqfile => 'notags.fastq', 
	reference => 'path/to/reference.fa', 
	outfile => 'tradis_mapped.sam'
);
$mapping->index_ref;
$mapping->do_mapping;

Generate insertion site plot for only reads with a mapping score >= 50

Bio::Tradis::TradisPlot(mappedfile => 'mapped.bam', mapping_score => 50)->plot;

Run complete analysis on fastq files listed in file.list. This includes filtering and removing the tags allowing one mismatch to the given tag, mapping, BAM sorting and creation of an insertion site plot and stats file for each file listed in file.list.

Bio::Tradis::RunTradis(
	fastqfile => 'file.list', 
	tag => 'GTTGAGGCCA', 
	reference => 'path/to/reference.fa', 
	mismatch => 1
)->run_tradis;

License

Bio-Tradis is free software, licensed under GPLv3.

Feedback/Issues

We currently do not have the resources to provide support for Bio-Tradis. However, the community might be able to help you out if you report any issues about usage of the software to the issues page.

Citation

If you use this software please cite:

"The TraDIS toolkit: sequencing and analysis for dense transposon mutant libraries", Barquist L, Mayho M, Cummins C, Cain AK, Boinett CJ, Page AJ, Langridge G, Quail MA, Keane JA, Parkhill J. Bioinformatics. 2016 Apr 1;32(7):1109-11. doi: 10.1093/bioinformatics/btw022. Epub 2016 Jan 21.

bio-tradis's People

Contributors

Stargazers

Watchers

bio-tradis's Issues

bacteria_tradis tsv extension bug

Hello

I found a small bug, if I provide the input file ending in the extension ".tsv" to bacteria_tradis it runs without complain but it reports zero reads mapped. However, it works if I provide the same file but ended in ".txt".

Thanks for a great tool!

ftp download link in biotradistutorial.pdf no longer exists

On page three of "Using the Bio::TraDIS pipeline" pdf file the user is directed to download a .bam file using curl. The ftp link provided no longer exists, but the http link to the file in the ebi website still works correctly.

The incorrect line is:

"curl ftp://ftp.sra.ebi.ac.uk/vol1/ERA320/ERA320707/bam/12418_1%2310.bam > 12418_1#10.bam"

It should be:

"curl ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR541/ERR541286/12418_1%2310.bam > 12418_1#10.bam"

Installing Bio-Tradis Fail

I am using linuxbrew to install Bio-Tradis. I've ran brew doctor and brew update, followed by brew install of r, smalt, samtools and cpanm Everything went through smoothly.

However, when I ran sudo cpanm -f Bio::Tradis, I received this message:
--> Working on Bio::Tradis
Fetching http://www.cpan.org/authors/id/A/AJ/AJPAGE/Bio-Tradis-1.3.3.tar.gz ... OK
Configuring Bio-Tradis-1.3.3 ... OK
Building and testing Bio-Tradis-1.3.3 ... FAIL
! Testing Bio-Tradis-1.3.3 failed but installing it anyway.

I don't have any additional error message to troubleshoot. Thoughts on how to get Bio-Tradis to run?

how is the insertion index calculated?

Issues on running BioTraDIS on multiple contigs

Hi,

We've been running BioTraDIS on a .embl file which contains a chromosome and two plasmids (GCA_000008865.2.txt). On our data the command bacteria_tradis works fine and as would be expected this produces three .insertion_site_plot.gz each one corresponding to either the chromosome or plasmids (SK-1-1-2-5.ENA_AB011548_AB011548.2.insert_site_plot.gz, SK-1-1-2-5.ENA_AB011549_AB011549.2.insert_site_plot.gz, SK-1-1-2-5.ENA_BA000007_BA000007.3.insert_site_plot.gz). Its important to note that each of these files have the same amount of lines each corresponding to the length of the particular contig in bp. However when we come to run tradis_gene_insert_sites for each insert_site_plot file we start to encounter a couple of issues within the tradis_gene_insert_site.csv generated.

An example of the tradis_gene_insert_sites generated files are here (trimmed_1-1-2-5.fq.ENA_AB011549_AB011549.2.tradis_gene_insert_sites.csv, trimmed_1-1-2-5.fq.ENA_BA000007_BA000007.3.tradis_gene_insert_sites.csv, trimmed_1-1-2-5.fq.ENA_AB011548_AB011548.2.tradis_gene_insert_sites.csv). Where BA000007 is the chromosome and AB011548 + AB011549 are the plasmids.

Issue 1
The first issue that we have encountered is that in the annotations which do not correspond to the particular chormasone or plasmid ran there is data such as read count and insertion indices being generated for some annotations. This is particularly noted in our plasmid files (denoted by the AB) where we see a read count being generated for genes which are present on the chromosome and the other plasmid, which shouldn't be happening. My guess is that the annotation for each contig is being overlayed over the insert_site_plot file creating entries for each contig up to the length of the insert_site_plot file. Our assumption is to ignore the annotations for the other contigs and set these back to 0. Is there anyway to prevent this ?

Issue 2

Secondly, we've noted another issues in regards to annotations where the genomic start and genomic end of a feature span the beginning and end of a DNA sequence. An example of this can be found here in the gene tagA.

locus_tag	gene_name	start	end	strand	read_count	ins_index	gene_length	ins_count	fcn
AB011549_1_92527_2502	tagA	92527	2502	1	0	0	-90024	0	ToxR-regulated lipoprotein
AB011549_1_2589_3464	etpC	2589	3464	1	2954	0.277397	876	243	Type II secretion pathway related protein
AB011549_1_3675_5432	etpD	3675	5432	1	7430	0.261092	1758	459	Type II secretion pathway related protein

Here tagA spans the start of the plasmid sequence and really should have a gene length of approximately 2762bp, however generates a negative gene length. In addition because of this no data entered for the gene in question. Is there anyway to solve this?

Thanks for you help

Mat

Mapping score always set to 1 if -m option used with plot_tradis

Using the plot_tradis command line tool, the minimum mapping quality is always set to 1 if the option -m is used. This means that genes in repeat regions are incorrectly identified as essential.

Port 18 volume discrepancy?

For the 10 cycle transposon chemistry, C1 ("Custom 1 Primer Mix") has 300 µL and then 75 µL aspirated for the first sequencing read:

https://github.com/sanger-pathogens/Bio-Tradis/blob/master/recipes/Transposon10/Chemistry/Chemistry.xml#L303-L304

Another 300 µL + 75 µL are aspirated for the first index read:

https://github.com/sanger-pathogens/Bio-Tradis/blob/master/recipes/Transposon10/Chemistry/Chemistry.xml#L326-L327

Yet the protocol in the supplement of the manuscript describes a loading of only 608 µL total into port 18 on a MiSeq:

Am I misunderstanding the sequencing and chemistry protocol, or would this not lead to insufficient volume of C1 being loaded for the run (750 µL > 608 µL)? Is there a typo in the sequencing recipe, or should we just make up say 800 µL of HT1 + primer for C1 / port 18?

Thank you!

Error in step 3.5 of bacteria_tradis, using samtools 1.3

When using samtools 1.3, there is an error in step 3.5 of the bacteria_tradis pipeline. The error is:

..........Step 3.5: Convert output from SAM to BAM and sort
[bam_sort] Use -T PREFIX / -o FILE to specify temporary and final output files

This seems to be due to changes in the samtools options (https://github.com/samtools/samtools/releases/tag/1.3):

obsolete samtools sort in.bam out.prefix usage has been removed. If you are still using ‑f, ‑o, or out.prefix, convert to use -T PREFIX and/or -o FILE instead. (#295, #349, #356, #418, PR #441; see also discussions in #171, #213.)

Problems with bacteria_tradis mapping

Hi,

I have two libraries that I am having problems with the bacteria_tradis mapping for. I have used cutadapt to trim the transposon (and adapters) from the reads and selected only the reads that contained the transposon to go to the output folder. Reads were also filtered to discard anything shorter than 15bp.

I have tried to align these reads using both BWA and smalt (using tagless mode for both). Using BWA with default parameters, I’m getting 100% matched reads for every sample, but about half have 0% mapping. The other half have varying levels of mapping, and all of these samples have at least twice as many unique insertion sites as I am expecting.

Using smalt with -m 10 (all other parameters at default), I am getting 100% matched reads but very low levels of mapping, < 5% for every sample. The number of UIS is still higher than I would expect for one of the libraries, and I have very low coverage (average around 1.5 reads per UIS).

For both BWA and smalt I have seen that the number of UIS seems to be proportional to the number of reads rather than the number of mutants in the libraries.

Do you have any ideas of things I could try for either mapping tool? Thanks!

tradis_comparison issue

on a new machine, fresh conda install:

Any thoughts on how to rectify? Thanks

phil@BA085063:~$ tradis_comparison.R -h
Loading required package: edgeR
Loading required package: limma
Error: package or namespace load failed for ‘edgeR’:
package ‘edgeR’ was installed by an R version with different internals; it needs to be reinstalled for use with this R version
Warning message:
package ‘edgeR’ is not available (for R version 3.6.1)

installing biotradis on MacOS

Hi there,
I am having issues installing Bio-Tradis using conda on MacOS. I have added the channels. Am I missing something?
Thanks!

PackagesNotFoundError: The following packages are not available from current channels:

  - biotradis

Current channels:

  - https://conda.anaconda.org/bioconda/osx-64
  - https://conda.anaconda.org/bioconda/noarch
  - https://conda.anaconda.org/conda-forge/osx-64
  - https://conda.anaconda.org/conda-forge/noarch
  - https://repo.anaconda.com/pkgs/main/osx-64
  - https://repo.anaconda.com/pkgs/main/noarch
  - https://repo.anaconda.com/pkgs/free/osx-64
  - https://repo.anaconda.com/pkgs/free/noarch
  - https://repo.anaconda.com/pkgs/r/osx-64
  - https://repo.anaconda.com/pkgs/r/noarch
  - https://conda.anaconda.org/r/osx-64
  - https://conda.anaconda.org/r/noarch
```

`tradis_merge_plots` cannot find `tradisfind`

When using the Docker image within Singularity, invoking tradis_merge_plots produces the error:

sh: 1: tradisfind: not found

This is not surprising as I cannot find an instance of tradisfind within the Bio-Tradis repository.

Is it possible that this tool was mistakenly left behind within your dev/execution environment?

EdgeR not working with new R version?

Not sure if anyone else is seeing this or if this is fixable from the BioTraDIS side, but whenever I try to run the tradis_comparison.R script it has issues with not being able to install or use the package edgeR. I'm trying to run it on a previous version of R but wondered if anyone else had seen this or has a fix?

Essentiality function not working

Hi,

I have issues with the essentiality function not returning any essential genes due to not finding an intersection point. I attach the histogram of the sample and you can see that there are 2 peaks, but the first is not recognised. I had this issue with several samples. Could you suggest a solution for this?
Regards and thanks for your help.
Sarah

Problem with insertion plots (files longer than they should be)

Hi,
I wanted to raise another issue with insertion plots and suggest a possible solution for it. In issue #86 (closed), several users mentioned that their insertion plots would not open in Artemis. Although this was not considered at the time to be a problem with the biotradis code, I believe it is. As pointed out by others, these insertion plots are occasionally longer than the reference genome. I believe the problem originates with how chimeric alignments of reads covering both the end and the beginning of the circular genome are treated. I think these reads are not seen as being chimeric and instead, padding is added (to the beginning of the genome in my case) to cover for positions that are not matched. This creates extra positions and results in insertion plots that are longer than the reference sequence.

To give you a more specific example, we are looking at a genome with length 4349904 bases. You'd expect the insertion plot to have that many rows but instead it has 4349956 rows. If you examine the end region of the genome and the reads that are aligned there, you will see that nearly all align all the way to the end but in reality they have been chopped up: 52 bases align at the end of the genome and 64 bases align at the start. When your module InsertSite.pm adds padding, it seems to incorrectly assume that the bits of the reads that align at the start should have a padding of 52 bases following them, when in reality, these parts are already aligned and present at the end of the genome.
Here is an example read that's been split between the start and the end of the genome in the bam file (note the SA:Z:... tag indicating the chimeric alignment) - first is the bit aligned at the start, following by the bit aligned at the end of the genome:

Here are what the reads look at the end of the genome (the insertion plot at the top is from biotradis but I have chopped off the first 52 nucleotides to allow it to show on Artemis - as you see, once this padding is removed, the insertions align well with the reads from the bam file); the read in the red rectangle is the same read shown above:

When fixing this, you may want to consider as well how to fix the insertion at the start of the genome as the call of a site there is wrong. All reads mapping right at the start actually are split alignments carrying over from the end of the genome so no insertion site should be called.

Thanks again for your help.
Irilenia

cpan update?

Hi,

I've just noticed the version of tradis_essentiality.R we're distributing via github and cpan are out of synch -- it looks like it's still the old version on cpan. So, anyone following our installation recommendations will be getting an out of date version. Would it be possible to synch the two?

Thanks,
Lars

Possible bug handling reads with clipping when converting mapped reads to insertion plot

I recently installed biotradis using conda

Name Version Build Channel
biotradis 1.4.5 hdfd78af_5 bioconda

I noticed that I had an issue with reads with clipping (which I think you've seen before) where reads with soft clipping at the 5' end take the first mapped nucleotide as the insertion site, resulting in an artificially inflated "unique insertions" count. I can see the fix in the Cigar.pm script,

but found that this fix wasn't in the conda-installed hdfd78af_5 build.
I tried a few conda envs with alternate versions with no success, but possibly some user error on my part somewhere, so in any case I downloaded Bio-Tradis-master directly from github to test again.

This time with the correct Cigar.pm, I have a different error. When I view the plot file in artemis, it looks like all insertion sites have been artificially shifted by a consistent factor, while the bam file viewed in IGV is unchanged
I think (but I'm not familiar enough with perl to be sure I've understood the code correctly) the issue is I have reads mapping across the annotation origin, such that their cigar is something like "77H32M". Because these are the first reads encountered, they seem to offset the rest of the data by "77". I wondered if the issue is here

$current_coordinate += $number;

In any case, I wondered if there might be an option to exclude all reads with clipping (hard and soft)? I know it's very conservative, but even when I don't have the issue with annotation-origin-spanning reads, I found that the following

                                $current_coordinate -= $number if($results{start} == 0);
                                $results{start} = $current_coordinate if($results{start} == 0);
				$current_coordinate += $number;
                                $results{end} = $current_coordinate -1 if($results{end} < $current_coordinate);

reported false insertion sites for "chimeric reads". As in, if a read had the cigar 40S71M, the insertion coordinate would be adjusted by 40 to give a tn-insertion event 40bp upstream, yet when I look at the individual read it's not a real transposon-junction, and I'm still getting artificially inflated insertion reporting. (This particular library is intentionally not very diverse, so doesn't need much sequencing coverage, but as a consequence is more susceptible to "noise")

I hope this makes sense, I'd be happy to try and explain some more. Maybe this is just specific to my data(?) but posting here just in case. Thanks

Essentiality analysis code

Hello,

I was wondering about the piece of code in the beginning of tradis_essentiality.R:

h <- hist(ii, breaks=200,plot=FALSE)
maxindex <- which.max(h$density[10:length(h$density)])
maxval <- h$mids[maxindex+3]

I understand why do we skip first 9 bins (even though in my distribution it's too much) - but why do we offset it by 3 when looking for max value?
Seems insignificant, but I've tried changing these numbers, and the number of genes deemed essential is changing by 100 sometimes (700 vs 800 - the library is not very dense, about 100k unique insertions).

Thank you

-- Alex

tradis_comparison.R error

Hello, I keep getting the "error unexpected input in _". I don't see an underscore at the beginning of an argument or option and I've tried surrounding the names with backticks but to no avail.

danielmediati$ tradis_comparison.R -o outputfile.csv -p outputplot.pdf --controls controlsLB.txt --conditions conditionsM9.txt
Loading required package: limma
Read 2 items
Read 2 items
Error: unexpected input in "_"
Execution halted

Im concerned because the script is able to generate and produce both output files, that being the "output file.csv" and the "output plot.pdf", are these files accurately constructed despite the error occurring? Has this issue happened to anyone else and if so how was this resolved?

Cheers for any help/advice

tradis_essentiality.R output a file containing change points?

Could tradis_essentiality.R write out a text file with the essential and ambiguous change points. This would save the user from having to manually open each pdf and record the values in a table.

Thank you for your time!

Cheers,
Ben

Using Bio-Tradis to analyze mariner-transposon library

I am currently using Bio-Tradis to analyze several mariner-transposon libraries. The tranposon genome junctions are PCR amplified up using P5-transposon-specific primer and P7-common-hexamer primer. The sequencing of this low-diversity library is done by spiking in with other high-diversity samples, rather than using the dark cycles.

I am following the Tradis protocol running bacteria_tradis, followed by tradis_gene_insert_sites. The transposon tag that I am using is "CCGGGGACTTATCAGCCAACCTGT".

Mariner transposons insert into TA sites in the genome. One issue that I am encountering is that I am getting more insertions (ins_counts) per gene than the number of TA sites present in the gene. What's is going on?

Another issue that I have is that I am having trouble opening the plot files directly in Artemis. Is there another way to visualize this (i.e. using IGV)?

I've attached my fastq.stats file from one Tradis run.
fastq.stats.txt

Unable to install biotradis

I have tried install biotradis through conda using the following command:
conda install -c bioconda biotradis
however it wasn't able to resolve the environment and this error message came up:
CondaHTTPError: HTTP 000 CONNECTION FAILED for url https://repo.anaconda.com/pkgs/main/linux-64/repodata.json
Elapsed: -

Can someone help me out. thanks

-mm paramater not working as expected

When specifying a tag and setting 0 mismatches (the default) the software finds numerous reads with the tag. However, if we change the number of mismatches e.g. to 1 then no reads are found, we get the message 'None of the input files contained the specified tag'.

Our tag contains wildcard characters as reads start with a sequencing primer so our -t parameter starts with '.*.GG', I wonder if this is causing the issue and perhaps a more informative error message can be displayed if that's the case.

Installation on Ubuntu 17.10 fails

Hey everyone,

I tried setting up a Singularity container for our analysis and tried several things to achieve that:
a.) Conda installation (which doesn't work the way it is mentioned in the GitHub Readme)
b.) Linuxbrew installation (similar issues, missing dependencies)

So instead, I used this here to achieve a Blank Ubuntu 17.10 installation with all the requirements in place:

#Fetch updates
apt-get update 
apt-get install -y r-base r-recommended samtools wget cpanminus
#Fetch smalt and compile it
wget https://netcologne.dl.sourceforge.net/project/smalt/smalt-0.7.6.tar.gz
tar xzfv smalt-0.7.6.tar.gz
rm smalt-0.7.6.tar.gz
cd smalt-0.7.6
./configure 
make 
make install

#Install some other stuff for R
Rscript -e 'source("http://bioconductor.org/biocLite.R")' -e 'biocLite(c("edgeR","getopt", "MASS"))'

cpanm -f Bio::Tradis

Things work fine until I try to install Tradis via the very last command and all tools are in place. The problem seems to persist since dependenceis of Tradis are not met, copying the error message in here for that purpose:

--> Working on Task::Weaken
Fetching http://www.cpan.org/authors/id/E/ET/ETHER/Task-Weaken-1.05.tar.gz ... OK
Configuring Task-Weaken-1.05 ... OK
Building and testing Task-Weaken-1.05 ... OK
Successfully installed Task-Weaken-1.05
Building and testing PPI-1.236 ... OK
Successfully installed PPI-1.236
! Installing the dependencies failed: Module 'MooseX::Types::Structured' is not installed
! Bailing out the installation for Parse-Method-Signatures-1.003019.
--> Working on Scope::Upper
Fetching http://www.cpan.org/authors/id/V/VP/VPIT/Scope-Upper-0.30.tar.gz ... OK
Configuring Scope-Upper-0.30 ... OK
Building and testing Scope-Upper-0.30 ... OK
Successfully installed Scope-Upper-0.30
--> Working on Devel::Declare
Fetching http://www.cpan.org/authors/id/E/ET/ETHER/Devel-Declare-0.006019.tar.gz ... OK
Configuring Devel-Declare-0.006019 ... OK
Building and testing Devel-Declare-0.006019 ... OK
Successfully installed Devel-Declare-0.006019
! Installing the dependencies failed: Module 'Parse::Method::Signatures' is not installed
! Bailing out the installation for TryCatch-1.003002.
! Installing the dependencies failed: Module 'TryCatch' is not installed
! Bailing out the installation for Bio-Tradis-1.4.0.
72 distributions installed
ABORT: Aborting with RETVAL=255
Cleaning up...

Any ideas what we could do here?

test failures with samtools v1.10

Hello,

Due to a version parsing bug, in Debian we can no longer build the bio-tradis package as we have upgraded to samtools v1.10. When bypassing the version check in lib/Bio/Tradis/Samtools.pm, the tests no longer pass:

# Looks like you failed 3 tests of 15.
t/Bio/Tradis/AddTagsToSeq.t ................ 
ok 1 - use Bio::Tradis::AddTagsToSeq;
ok 2 - creating object
ok 3 - correctly select the bam output switch
ok 4 - testing output
ok 5 - checking file existence
not ok 6 - checking file contents
ok 7 - creating object
ok 8 - checking file existence
not ok 9 - checking file contents
ok 10 - number of reads as expected
ok 11 - creating object with cram file
ok 12 - correctly select the cram output switch
ok 13 - testing output
ok 14 - checking file existence
not ok 15 - checking file contents
1..15
Dubious, test returned 3 (wstat 768, 0x300)
Failed 3/15 subtests

and

Attribute (_stats_handle) does not pass the type constraint because: Validation failed for 'FileHandle' with value GLOB(0x561d84357a68) at reader Bio::Tradis::CommandLine::TradisAnalysis::_stats_handle (defined at lib/Bio/Tradis/CommandLine/TradisAnalysis.pm line 43) line 15
	Bio::Tradis::CommandLine::TradisAnalysis::_stats_handle('Bio::Tradis::CommandLine::TradisAnalysis=HASH(0x561d84357300)') called at lib/Bio/Tradis/CommandLine/TradisAnalysis.pm line 146
	Bio::Tradis::CommandLine::TradisAnalysis::run('Bio::Tradis::CommandLine::TradisAnalysis=HASH(0x561d84357300)') called at t/Bio/Tradis/CommandLine/TradisAnalysis.t line 33
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 255 just after 2.
t/Bio/Tradis/CommandLine/TradisAnalysis.t .. 
ok 1 - use Bio::Tradis::CommandLine::TradisAnalysis;
ok 2 - creating object
Dubious, test returned 255 (wstat 65280, 0xff00)
All 2 subtests passed

Can we ignore these failures, or do they require a proper fix from you all?

Thanks, as always.

Duplicate genes

fastq.stats.txt
CP011972-tiled.out.CP011972.tradis_gene_insert_sites.txt

Hi Folks,

I hope all is well in Sanger-land.

We've been working with TraDIS data in Pseudomonas syringae. Especially this beast:
https://www.ebi.ac.uk/ena/data/view/CP011972

Two of the genes stood out to my collaborators HopAM1-1 (IYO_023205) and HopAM1-2 (IYO_008385). These are 100% identical copies of a 830 nucleotide ORF. In our TraDIS experiments no insertions were found in either gene, but they aren't thought to be essential.
I ran a little test to see what happened to insertions in these genes. I simulated a fastq file that tiled 120mers from each position across the gene. I then ran the "bacteria_tradis" and "tradis_gene_insert_sites" pipelines on these. The genes both come back as having 0 insertions and "ins_index" values of zero.

Stats:
File	   	       	    	        hopam.fastq
Total Reads				1424
Reads Matched				1424
% Matched				100
Reads Mapped				1424
% Mapped				100
Unique Insertion Sites : CP011972	0
Seq Len/UIS : CP011972	 		NaN
Total Unique Insertion Sites		0
Total Seq Len/Total UIS			NaN

100% of the reads map to the genome, it looks like SMALT does sensible things here. The failure appears to be in the way Bio-Tradis counts insertion sites. Is it ignoring multi-mapped reads, or perhaps the parser doesn't recognise them.

I've since scaled this test up to tile reads across the entire genome, then identify the which genes/regions can actually be probed by your pipeline. Please excuse my awful code, but it gets the job done.

Generate a fastq that tiles 120mer reads across the entire genome (forward and reverse). Warning: generates a 5G file.

cat CP011972.2.fasta      | grep -v '^>' | tr -d "\n" | tr "a-z" "A-Z" | perl -lane '$l=120; $sq=$_; for($i=0; $i<(length($_)-$l+1); $i++){$s = substr($sq, $i, $l); $cnt++; printf "\@M00100:233:000000000-BFM3J:1:1102:15255:%d 1:N:0:GTGAAA\nGACAGTCGAC$s\n+M00100:233:000000000-BFM3J:1:1102:15255:%d 1:N:0:GTGAAA\n>1AA1111BCDBB3BBDFGB3BBB01BD3D1AAB1B1BBD2DABDA1B1//AAAB//011DDGDD2A2BDB2DDDDGHDA>//BD2D1/////11/B?>B2BB1B/>//?1BB?0/B1BBGBBBFGHH?B\n", 1000+$cnt, 1000+$cnt;}' > CP011972-tiled.fastq
revcomp CP011972.2.fasta  | grep -v '^>' | tr -d "\n" | tr "a-z" "A-Z" | perl -lane '$l=120; $sq=$_; for($i=0; $i<(length($_)-$l+1); $i++){$s = substr($sq, $i, $l); $cnt++; printf "\@M00100:233:000000000-BFM3J:1:1102:15255:%d 1:N:0:GTGAAA\nGACAGTCGAC$s\n+M00100:233:000000000-BFM3J:1:1102:15255:%d 1:N:0:GTGAAA\n>1AA1111BCDBB3BBDFGB3BBB01BD3D1AAB1B1BBD2DABDA1B1//AAAB//011DDGDD2A2BDB2DDDDGHDA>//BD2D1/////11/B?>B2BB1B/>//?1BB?0/B1BBGBBBFGHH?B\n", 7*(10**6)+$cnt, 7*(10**6)+$cnt;}' >> CP011972-tiled.fastq

echo "CP011972-tiled.fastq" > fastq.infiles

Run "bacteria_tradis":

bacteria_tradis -f fastq.infiles -t 'GACAGTCGAC' -r CP011972.2.fasta  --smalt_s 1 --smalt_k 10 --smalt_r 0 --smalt_y 0.8 -mm 5 -v

*Make the stats file readable:

grep ^File fastq.stats | tr "," "\n" > blah && grep -v ^File fastq.stats | tr "," "\n" > blah2 && paste blah blah2

File	                               CP011972-tiled.fastq
Total Reads			       13110900
Reads Matched			       13110900
% Matched			       100
Reads Mapped			       13110900
% Mapped			       100
Unique Insertion Sites : CP011972      6101020 (length: 6555569)
Seq Len/UIS : CP011972	 	       1.0745037715005
Total Unique Insertion Sites	       6101020
Total Seq Len/Total UIS		       1.


0745037715005

Call essential genes:

tradis_gene_insert_sites  -o tradis_gene_insert_sites CP011972.2.embl CP011972-tiled.out.CP011972.insert_site_plot

The CP011972-tiled.out.CP011972.tradis_gene_insert_sites file (attached) also revealed some interesting results.

*None of the "ins_index" values were 1.0 and "gene_length" and "ins_count" almost always differ by 1 i.e. gene_length = ins_count-1.

*335 genes had an "ins_index" less than 0.1. I think all the values should be ~1.0 in this test. There are 5883 genes in total, so roughly 6% all the PSA genes are not able to be studied with your pipeline.

It'd be great if someone could take a look at this, otherwise I'll need to find a workaround.

Best wishes,
Paul.

tradis_essentiality.R

There is a problem when i execute the tradis_essentiality.R script.
It throws me the following error on one particular dataset. I have attached the out.csv file which i tried as input.

-bash-4.1$ ./tradis_essentiality.R out.csv
Error in fitdistr(ii[I1], "gamma") :
'x' must be a non-empty numeric vector
Execution halted

How to deal with biological replicates - tradis_essentiality.R

Hi @lbarquist ,

I am wondering how to deal with biological replicates (e.g. 3) when calculating the essential genes using

tradis_essentiality.R

Do you just take the median of the insertion sizes 3 replicates, calculate the insertion index once and then plug them into the script?
Or is there a way to calculate an interval range?
Any other ideas?
Thanks!

Best,
Simon

Possible issue with calculation of read start and ends in module Cigar.pm

Hi,
We attempted to use your pipeline to process the sequencing results of libraries that were produced using the Himar1 transposon. We were getting some strange results and whilst looking closer at the problem, it became obvious that a number of insertion sites were called at places that appeared to be wrong. The forward strand did no seem to be affected but many reads on the reverse strand were associated with insertion sites that were some short distance from the actual insertion site. After a bit of digging into the code I believe that the problem is in Cigar.pm.

To explain the issue as best as I could, I created a couple of slides which I have attached here as snapshots. In the first slide, you see at the top a snapshot from Artemis showing two larger peaks where the TA insertion is expected. The reads below are on the positive strand on the right and negative strand on the left. If the reads are counted, we find that the positive-strand reads add up correctly to the red peak but the negative strand reads do not add up to the blue peak (the picture has been cropped so you can't see all the reads but I can send you the original). However, reverse reads that match 100% to the reference do add up to the blue peak so something is not right with the reads that do not fully match the reference.

I think the problem is in the calculation within Cigar.pm. I have again annotated the code using information for this specific read I have highlighted above in red:

If I followed the calculation correctly (I admit to not have run this myself - I'm simply following the code in my head), then the start of this read is set to 1531 and the end to 1640. The program correctly uses the end of the read to call the insertion site (as the read is on the reverse strand) but 1640 is calculated wrongly, in my opinion. The blue peak at 1640 corresponds to this call and I'm unable to find any other read that could be responsible for it.

Given the evidence above, I think there is a problem with calculation of read starts and ends when soft-clipping happens and the cigar string starts with action "S" rather than action "M".

I saw that someone else had raised a similar issue before (although they didn't look into why it was happening - issue #86; now closed) and they were advised that this could be a result of polymerase slippage. There may well be cases like this as well in the data but I think there is an additional issue with how the read start/ends are calculated. In fact a reply to that issue explained that the problem goes away when the smalt-y parameter is set to 1 (which would mean that soft-clipped reads are not considered). If the sites were real sites of insertion (but produced by slippage), I don't see how this fix would have worked, although I admit I perhaps do not understand well the technology here.

I have little experience with Tn-seq but it seems to me that the problem wouldn't affect generally results from saturation libraries or libraries where insertion happens at random places in the genome. The insertion sites called are generally very close to the actual real site and so as long as they fall within a gene's boundaries, they would have been added to the correct gene anyway. However, for cases like ours, the calculation is problematic.

I'm sorry for the really long message but I wanted to explain as best as I can the situation. It is entirely possible that I'm wrong but looking at the code I believe there is a problem, just not one that would affect most people's calculations or that would even be visible in a situation where many insertion sites exist across the whole genome.

Thank you very much in advance,
Irilenia