Giter VIP home page Giter VIP logo

sanger-pathogens / saffrontree Goto Github PK

View Code? Open in Web Editor NEW
23.0 10.0 4.0 1.35 MB

SaffronTree: Reference free rapid phylogenetic tree construction from raw read data

Home Page: https://sanger-pathogens.github.io/saffrontree/

License: Other

Shell 3.84% Python 80.30% TeX 15.13% Dockerfile 0.72%
genomics next-generation-sequencing sequencing research bioinformatics bioinformatics-pipeline global-health infectious-diseases pathogen

saffrontree's Introduction

SaffronTree

Fast, reference-free pseudo-phylogenomic trees from reads or contigs.

Build Status
License: GPL v3
status
install with bioconda
Container ready
Docker Build Status
Docker Pulls
codecov

Contents

Introduction

Quickly build a tree directly from raw reads or from assembled sequences, without the need for a reference sequence or de novo assemblies. SaffronTree takes FASTQ/FASTA files as input and uses a kmer analysis to build a phylogenetic neighbour joining tree in newick format. It works well for small sets of samples (less than 50) but as the algorithm has a complexity of O(N^2), it does not perform well after that point. This is good enough to give you rapid insights into your data in minutes, rather than hours. During outbreak investigations, researchers and epidemiologies often want to quickly rule a sample in or out of an outbreak. MLST does not provide enough granularity to achieve this, since it is based on only 7 house keeping genes. SaffronTree utilises all of the genomic data in the sample to create a visual representation of the clustering of the data. It support NGS data (such as Illumina), 3rd generation data (Pacbio/Nanopore) and assembled sequences (FASTA).

Installation

SaffronTree has the following dependencies:

Required dependencies

  • KMC (>= 2.3)
  • Spades (>= 3.10.1)
  • pyfastaq (>= 3.12.0)
  • biopython (>= 1.68)
  • dendropy (>= 4.1.0)

Required resources

RAM (memory)

The RAM(memory) requirement is low, because KMC is extremely efficient and mostly disk based.

Disk space

By default all of the intermediate files are cleaned up at the end, so the overall disk space usage is quite low. The intermediate files can be kept if you use the 'verbose' option.

There are a number of installation methods. Choosing the right one for the system you use will make life easier. KMC version 2.3+ is supported, with KMC 3+ providing the best performance. If you encounter an issue when installing SaffronTree please contact your local system administrator. If you encounter a bug please log it here.

  • Linux/OSX/Windows/Cloud
    • Docker
    • Conda
  • Linux
    • Debian Testing/Ubuntu 16.04 (Xenial)
  • OSX
    • OSX manual method

Linux/OSX/Windows/Cloud

Docker

Install Docker. We have a docker container which gets automatically built from the latest version of SaffronTree. To install it:

docker pull sangerpathogens/saffrontree

To use it you would use a command such as this (substituting in your directories), where your files are assumed to be stored in /home/ubuntu/data:

docker run --rm -it -v /home/ubuntu/data:/data sangerpathogens/saffrontree saffrontree output sample1.fastq.gz sample2.fastq.gz

To run some of the example data that is part the repository run:

docker run --rm -it -v /home/ubuntu/data:/data sangerpathogens/saffrontree saffrontree output_directory /usr/local/lib/python3.5/dist-packages/saffrontree/example_data/fastqs/start_Salmonella_enterica_subsp_enterica_serovar_Typhi_Ty2_v1_1.fastq.gz /usr/local/lib/python3.5/dist-packages/saffrontree/example_data/fastqs/start_Salmonella_enterica_subsp_enterica_serovar_Typhimurium_SL1344_v4_1.fastq.gz

You will then have a tree in:

/home/ubuntu/data/output_directory/kmer_tree.newick

Conda

Install Conda. Then install the dependancies using conda and the software using pip:

conda config --add channels bioconda
conda install git kmc
pip install git+git://github.com/sanger-pathogens/saffrontree.git

Linux

The instructions for Linux assume you have root (sudo) on your machine.

Debian Testing/Ubuntu 16.04 (Xenial)

apt-get update -qq
apt-get install -y git python3 python3-setuptools python3-biopython python3-pip kmc
pip3 install git+git://github.com/sanger-pathogens/saffrontree.git

OSX manual method

Ensure Python 3.5+ is available or install it from https://www.python.org/downloads/ then follow the instructions below:

wget https://github.com/refresh-bio/KMC/releases/download/v3.0.0/KMC3.mac.tar.gz
tar zxf KMC3.mac.tar.gz
export PATH=$PWD:$PATH
pip3 install git+git://github.com/sanger-pathogens/saffrontree.git

Running the tests

The test can be run from the top level directory:

./run_tests.sh

Usage

usage: saffrontree [options] output_directory *.fastq.gz

SaffronTree: Fast, reference-free pseudo-phylogenomic trees from reads or contigs.

positional arguments:
  output_directory      Output directory
  input_files           FASTQ/FASTA files which may be gzipped

optional arguments:
  -h, --help            show this help message and exit
  --kmer KMER, -k KMER  Kmer to use, depends on read length [31]
  --min_kmers_threshold MIN_KMERS_THRESHOLD, -m MIN_KMERS_THRESHOLD
                        Exclude k-mers occurring less than this [5]
  --max_kmers_threshold MAX_KMERS_THRESHOLD, -x MAX_KMERS_THRESHOLD
                        Exclude k-mers occurring more than this [255]
  --threads THREADS, -t THREADS
                        Number of threads [1]
  --keep_files, -f      Keep intermediate files [False]
  --verbose, -v         Turn on more debugging output [False]
  --version             show program's version number and exit

Input parameters

The following parameters change the results:

kmer: Choosing a kmer size is not an exact science, and can greatly influence the final results. This kmer size is used by KMC for counting and filtering. It should be an odd number, and a suitable range is between 25-61. If you choose a kmer too small, you will get too many false positives. If you choose a kmer too big, you will use a lot more RAM and potentially produce insufficient data to construct a tree from. Quite often with Illumina data the beginning and end of the reads have higher sequencing error rates. Ideally you want a kmer size which sits nicely inside the high quality portion of the reads. Quality trimming your reads can help if the quality collapses quite badly at the end of the read.

min_kmers_threshold: This value lets you set a minimum threshold for the occurance of a kmer with raw reads. You need about 6x depth to detect a variant with reasonable confidence. Setting this too low will allow random noise (from sequencing errors) through and give you lots of false positives. The maximum suggested value is half the estimated depth of coverage for paired ended data (since forward and reverse reads are evaluated independently). If an input file is in FASTA format, this value is set to 1 for that file, as it assumed it is assembled contigs rather than reads.

max_kmers_threshold: This value lets you set a maximum threshold for the occurance of a kmer. With KMC, there is a catchall bin for occurances of 255 and greater (so 255 is the maximum value). By default it is set to 254 which excludes this catchall bin for kmers, and thus the long tail of very common kmers. This reduces the false positives. You need to be careful when setting this too low since you could be excluding interesting kmers.

The following parameters have no impact on the results:

threads: This sets the number of threads available to KMC. It should never be more than the number of CPUs available on the server. If you use a compute cluster, make sure to request the same number of threads on a single server. It defaults to 1 and you will get a reasonable speed increase by adding a few CPUs, but the benefit tails off quite rapidly since the I/O becomes the limiting factor (speed of reading files from a disk or network).

verbose: By default the output is silent and all intermediate files are deleted as it goes along. Setting this flag allows you output more details of the software as it runs and it keeps the intermediate files.

Output

A single phylogenetic tree in Newick format is created in the output directory. This is compatible with BioPython and can be viewed with FigTree. Unfortunatly there is no published standard for the newick format so there can be some incompatibilty issues between newer file formats, and older software.

Example usage

This repository includes some sample data, consisting of FASTQ files and FASTA files derived from Salmonella reference genomes. Only the first 10,000 bases of each reference was taken, however since they all have the same start sites (dnaA) they contain some overlapping material. These sequences were then used to generate simulated reads in FASTQ format. The data itself covers a variety of serovars of Salmonella (a highly clonal, medically important pathogen). The S. Typhimurium samples would be expected to cluster near each other. Similarly the S. Typhi and S. Paratyphi A would be expected to cluster together. S. Weltevreden is an outgroup and should not be close to any of the other serovars. All of these serovars, except S. Weltevreden, cause very severe disease in Humans.

To build a tree with the FASTA files only:

saffrontree output_directory saffrontree/example_data/fastas/*.fa

This is the resulting tree.

To build a tree with the FASTQ files only:

saffrontree output_directory saffrontree/example_data/fastqs/*.fastqs.gz

This is the resulting tree.

Finally you can mix FASTAs and FASTQ files (which my be GZipped):

saffrontree output_directory saffrontree/example_data/fastas/*.fa saffrontree/example_data/fastqs/*.fastq.gz

License

SaffronTree is free software, licensed under GPLv3.

Feedback/Issues

This software is now solely community supported. Please report any issues to the issues page.

Citation

"SaffronTree: Fast, reference-free pseudo-phylogenomic trees from reads or contigs", Andrew J. Page, Martin Hunt, Torsten Seemann and Jacqueline A. Keane. The Journal of Open Source Software, 2(13), 2017. http://joss.theoj.org/papers/10.21105/joss.00243

FAQ

How to contribute to the software

If you wish to contribute to this software please fork the project on GitHub and submit a pull request. We will endevor to review it within a few days. Please include automated tests and example data (if relevant) in your pull request and ensure all the existing tests already pass. Comments and documentation should be in British English.

I found a bug with some data but its private, can I send it to you for debugging?

Please do not send us any private data. We will not sign an NDA.

Aren't distance matrix based trees inherently phenetic?

It depends on who you talk to. Yes they have less power than more modern methods which reconstruct the ancestory, however if you have a small amount of raw data you can get an answer faster.

Why UPGMA trees?

It's fast, its implemented already in python, so installation is trivial, and it doesnt give you negative branch lengths like neigbour joining.

Can I use long read data?

If your PacBio/nanopore (long read) data is in FASTQ format, then the answer is yes, however we have only tested it on corrected reads. Uncorrected reads are unlikely to work because your nearly guaranteed that a sequencing error will occur inside of the length of a k-mer.

Will you make it work with Python 2?

No. Python 3 is well supported, stable and mature, so please just install this instead.

Will there be a Windows version?

The only way to run it on Windows is via Docker. We have no plans for a native version. Honestly though, if your using Windows to perform bioinformatics, your in trouble.

How do I view Newick trees?

The newick format is widely supported and I find FigTree to be excellent.

Do you plan to support other formats like Nexsus?

No, we have no plans to support other tree types, since Newick does the job.

It's really slow on massive datasets

Yes, the complexity is O(N^2), which means it scales poorly. But for a few dozen samples it works much quicker than other methods, so it fills a niche.

What method is used for tree construction?

We use UPGMA, which is like Neighbour Joining.

Can I send you my data for analysis?

No, please install the software yourself and perform your own analysis.

The branch lengths are crazy?

It's a quick and dirty analysis from raw reads, so accuratly estimating branch lengths can be difficult. What's important is what samples are near each other on a tree.

Do I need to provide both forward and reverse reads?

No, all FASTQ files are treated independantly, so they will end up in the same place in the tree (if everything goes to plan).

Can I mix FASTA and FASTQ files?

Yes.

Can the input files be GZipped?

Yes, it automatically uncompresses them on the fly. You can mix and match compressed and uncompressed files.

saffrontree's People

Contributors

andrewjpage avatar garethpeat avatar seretol avatar ssjunnebo avatar tseemann avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

saffrontree's Issues

wrong tree from example data

Hello
I use ubuntu 18.04 and i install with : Debian Testing/Ubuntu 16.04 (Xenial)

apt-get update -qq
apt-get install -y git python3 python3-setuptools python3-biopython python3-pip kmc
pip3 install git+git://github.com/sanger-pathogens/saffrontree.git

I find the file in my path ~/.local/lib/python3.6/site-packages/saffrontree
but there is no run_test.py
when i run example saffrontree output_directory saffrontree/example_data/fastas/*.fa my tree is wrong :
[&R] (('saffrontree/example_data/fastas/start_Salmonella_enterica_subsp_enterica_serovar_Typhi_Ty2_v1.fa':0.5,('saffrontree/example_data/fastas/start_Salmonella_enterica_subsp_enterica_serovar_Typhi_str_CT18_v1.fa':0.5,'saffrontree/example_data/fastas/start_Salmonella_enterica_subsp_enterica_serovar_Typhimurium_str_LT2_v1.fa':0.5):0.0):0.0,(('saffrontree/example_data/fastas/start_Salmonella_enterica_subsp_enterica_serovar_Weltevreden_str_10259_v0.2.fa':0.5,'saffrontree/example_data/fastas/start_Salmonella_enterica_subsp_enterica_serovar_Typhimurium_SL1344_v4.fa':0.5):0.0,('saffrontree/example_data/fastas/start_Salmonella_enterica_subsp_enterica_serovar_Paratyphi_A_str_AKU_12601_v2.fa':0.5,'saffrontree/example_data/fastas/start_Salmonella_enterica_subsp_enterica_serovar_Typhimurium_DT104_v1.fa':0.5):0.0):0.0);

Do you know what appends?
Best,

Temp files not cleaned up?

saffrontree -v -t 72 out1 *.gz

ls out1/
kmer_tree.newick 
 tmp4037372r  tmpafetgs9d  tmpeohwu5zr  tmpiktm7wbp  tmppa9dp89c  tmpt_y411ch  tmpyyyfzoaw
tmp02movqb_       tmp5oxycwr7  tmpag5o2eia  tmpfw22dg7c  tmpjzo7wosr  tmpptyagnj5  tmptcj9ezgv
tmp088xwgb3       tmp5quf9a99  tmpayt9w_t1  tmpgq4mqgol  tmpjzyyqup2  tmpqr8mgj48  tmptcjbxdw6
tmp0rtnxzvw       tmp6whv4zzf  tmpbxqrdi4x  tmpgt2dgdng  tmpk4fixrsj  tmpr6dt0s4r  tmpuddotnoe
tmp12rxjbjk       tmp8rq5trlz  tmpca6rh5yf  tmph1lvw4j5  tmpkgx_e7fv  tmprmvtd4af  tmpvflj5_qc
tmp14mwdjfs       tmp92g6__au  tmpccjyh3x3  tmphh7bwucq  tmpkwwksw2n  tmprtm8t0y5  tmpxuv7o3u3
tmp2viwsdve       tmp_jw5pw4h  tmpd8moshp7  tmphn_u97y8  tmpm27z1z7z  tmpslfuddqb  tmpy4qtrq01
tmp3pgnnosn       tmpa7u01vwt  tmpebdaapp6  tmphvdzidfa  tmpmphe0s6e  tmpsyhi1uqk  tmpyjxhs1ov

No such file or directory: 'kmc'

saffrontree saffrontree_out Results/tree_estimation.fasta 
Traceback (most recent call last):
  File "/home/lenore/.local/bin/saffrontree", line 4, in <module>
    __import__('pkg_resources').run_script('saffrontree==0.1.2', 'saffrontree')
  File "/home/lenore/.local/lib/python3.10/site-packages/pkg_resources/__init__.py", line 662, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/home/lenore/.local/lib/python3.10/site-packages/pkg_resources/__init__.py", line 1466, in run_script
    exec(script_code, namespace, namespace)
  File "/home/lenore/.local/lib/python3.10/site-packages/saffrontree-0.1.2-py3.10.egg/EGG-INFO/scripts/saffrontree", line 35, in <module>
  File "/home/lenore/.local/lib/python3.10/site-packages/saffrontree-0.1.2-py3.10.egg/saffrontree/SaffronTree.py", line 37, in __init__
  File "/home/lenore/.local/lib/python3.10/site-packages/saffrontree-0.1.2-py3.10.egg/saffrontree/KmcVersionDetect.py", line 11, in __init__
  File "/home/lenore/.local/lib/python3.10/site-packages/saffrontree-0.1.2-py3.10.egg/saffrontree/KmcVersionDetect.py", line 21, in find_version
  File "/home/lenore/Python-3.10.3/Lib/subprocess.py", line 420, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/home/lenore/Python-3.10.3/Lib/subprocess.py", line 501, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/home/lenore/Python-3.10.3/Lib/subprocess.py", line 966, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/home/lenore/Python-3.10.3/Lib/subprocess.py", line 1842, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'kmc'

pip install

Hello,
I couldn't install saffrontree by pip install git+git://github.com/sanger-pathogens/saffrontree.git
because of this error:
ERROR: Command errored out with exit status 128: git clone -q git://github.com/sanger-pathogens/saffrontree.git /tmp/pip-req-build-7mm6vj45 Check the logs for full command output.

Instead, I have install it like this:
pip install git+https://github.com/sanger-pathogens/saffrontree.git

Maybe will be useful to someone!
L

Running on fastqs produces matrix full of "1"

Running on bunch of fastq files, either .gz or ungzipped produces a matrix of identical distances full of 1s.
Running on fasta seems to work. Looks like it fails reading kmers from fastqs.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.