ctmrbio / bactpipe Goto Github PK

View Code? Open in Web Editor NEW

20.0 5.0 7.0 3.98 MB

BACTpipe: An assembly and annotation pipeline for bacterial genomics

Home Page: https://bactpipe.readthedocs.org

License: MIT License

Python 19.55% Groovy 46.05% Nextflow 30.53% Shell 3.06% Makefile 0.82%

nextflow pipeline whole-genome-sequencing bacteria bioinformatics assembly annotation

bactpipe's People

Contributors

Stargazers

Watchers

Forkers

pbieberstein emilio-r abhi18av shomailamalik vikash84 huangjialing1998

bactpipe's Issues

Break out tool parameter settings to a single shared config file

I suggest we consider putting all user-configureable tool parameters (e.g. bbduk trimming settings, assembly kmer sizes, maybe some prokka parameters, etc) in a single configuration file. Having all of these settings in a single file makes it easier to properly maintain, and also easier for end-users to access and modify parameters as desired, without having to go into the actual code for the main Nextflow workflow.

There are a few models for how to do this, but I suggest @b16joski takes a look at a few other Nextflow-based pipelines to see how the manage hierarchical configuration files and profiles. (e.g. https://github.com/ewels/NGI-RNAseq could be a starting point, but I don't think their solution is exactly what we are after).

concatenate mash screen output files and remove redundant files from output folders

We can have one output file summarizing mash screen results for the users. Also we can remove any intermediate files from the various BACTpipe output directories.

Implement Mash for initial screening of data

https://genomeinformatics.github.io/mash-screen/

http://mash.readthedocs.io/en/latest/tutorials.html#screening-a-read-set-for-containment-of-refseq-genomes

When the BactPipe 1.0 issues are fixed, make basic readme instructions

And don't forget to make mini-celebration when this is done :)

Don't include input reads in mash screen output folder

There's an easy way to ensure Nextflow doesn't include the input reads in the mash screen output folder.

Change this line: https://github.com/ctmrbio/BACTpipe/blob/master/bactpipe.nf#L21

publishDir "${params.output_dir}/mash.screen", mode: 'copy'

publishDir "${params.output_dir}/mash.screen", mode: 'copy', pattern: '*.{txt,tsv}'

Progressive Mauve in the bin directory is currently a symlink.

This needs to be fixed in some way for the future.

Rewrite assembly_stats.py

Rewrite in Python 3.5+ and preferably remove unnecessary Biopython dependency.

Restructure automatic BBDuk adapters functionality

It felt wrong having to always specify the path to adapters.fa for BBDuk, and I did some research and obviously it is possible for BBDuk to use the adapters.fa file that comes bundled with BBMap when installing it, without having to specify the entire path. It's dead easy: adapters is a keyword that BBDuk knows means the adapters file bundled with the installation. We can just give BBDuk ref=adapters instead of ref=/path/to/adapters.fa and it will automatically use the adapters file that is bundled. This removes the need to automatically download it if no path for a custom adapters.fa is given by the user, and cleans up a lot of the messy code we recently introduced into bactpipe.nf to handle all the edge cases. I really think we should consider fixing this before releasing BACTpipe publicly.

Investigate Shovill to replace SPAdes

Research Shovill details, how it works, what it does better than SPAdes etc.
Make list of necessary changes to pipeline if Shovill is implemented

Branch off version 1 into separate branch for archival

I suggest we create a branch to hold the last version of BACTpipe version1 in the repo. I don't expect any of us are interested in maintaining it now that we've improved so much in version 2, so let's just keep it as an archive in a side branch.
We could easily do this before merging the PR from Joseph's version 2 branch.

Make assess_mash_screen.py work if there are no good hits

This is the output file on the Brachyspira aalborgi PC4226IV PC4226IV.mash_screen.txt

Re-add `params.project` to configuration files

I must have accidentally removed params.project from the configuration files when I restructured everything a while back. Currently, you need to specify --project <SLURMPROJECT> when running BACTpipe on Rackham or Milou.

I suggest we add params.project = "" to the default config file (params.config), and add some logic to the main workflow to show a warning and exit gracefully if it has not been set properly.

Test issue

Missing bracket in config file

When running BACTpipe v2.0 with Yue just now, we noticed an unfortunate syntax error in nextflow.config.

https://github.com/ctmrbio/BACTpipe/blob/master/nextflow.config#L17

There is no closing bracket for the rackham profile definition.

Determine the best way to identify species from assembly

We need to determine the best way to identify a reference species based on the assembled sequences (contigs.fa). Current ideas have touched upon MASH, but also Jspecies, etc. Still no solution.

Annotation using custom curated references by prokka

specify if wanted by the user for prokka to use specific reference sequences during annotation.

Remove the automatic download of mashscreen db

Replace with

Parameter 'mashscreen_database' is empty. Download from https://address-to-database.

Change so that each sample gets an own prokka output folder

So that instead of that the pipeline outputs all prokka output files into the same folder, there will be separate folders, i.e. prokka/${sample_id}

Make the pipeline start on gzipped files rather than .fastq

Replace mash screen with BBTool's sendsketch.sh?

I missed that Brian Bushnell last year added a tool to the BBTools suite that does something extremely similar to what mash screen does: sendsketch.sh.

It sends a sketch of an input file (or pair of input files, compressed or not, as per usual BBTool manners) to JGI's sketch-server to compare against reference sketches of nt, refseq (default), silva, or img. It is very fast. Here's an example using the first 1000 or so reads from a Helicobacter pylori sample:

(base) [fredrik.boulund@ctmr-nas bactpipe_test]$ sendsketch.sh in=sample_R1.fastq.gz                                                          
Adding /home/ctmr/anaconda3/opt/bbmap-37.90/resources/blacklist_refseq_species_300.sketch to blacklist.                                       
Loaded 1 sketch in      0.716 seconds.                                                                                                        
                                                                                                                                              
Query: HWI-M03284:45:000000000-AWVDC:1:1101:13827:1954 1:N:0:GAATTCGTGTACTGAC   DB: RefSeq      SketchLen: 8587 Seqs: 26873     Bases: 8088773   gSize: 2134365  Quality: 0.9074 File: sample_R1.fastq.gz                                                                                      
WKID    KID     ANI     Complt  Contam  Matches Unique  noHit   TaxID   gSize   gSeqs   taxName                                               
36.19%  30.08%  96.37%  54.43%  34.52%  2752    0       2227    102617  2560243 63      Helicobacter pylori SS1                               
7.66%   5.46%   91.08%  47.88%  65.86%  469     3       2463    382638  1530316 2       Helicobacter acinonychis str. Sheeba                  
0.95%   0.77%   84.42%  36.38%  70.55%  66      0       2463    1163745 1730296 2       Helicobacter cetorum MIT 99-5656                      
0.79%   0.55%   83.87%  42.33%  70.77%  47      0       2463    104628  1475704 42      Helicobacter suis                                     
3.03%   0.03%   88.06%  100.00% 71.28%  3       0       2463    1204178 26419   1       Helicobacter phage KHP40                              
0.31%   0.07%   81.06%  6.77%   70.61%  10      0       937     543736  9071270 292     Rhodococcus opacus PD630                              
0.12%   0.12%   78.34%  30.08%  71.20%  10      0       2463    1578720 2046283 204     Helicobacter ailurogastricus                          
0.12%   0.09%   78.37%  37.89%  71.22%  8       0       2463    936155  1631038 1       Helicobacter felis ATCC 49179                         
0.11%   0.09%   78.10%  22.32%  70.79%  8       0       2087    35817   2772762 392     Helicobacter heilmannii                               
0.09%   0.07%   77.36%  35.40%  71.25%  6       0       2463    1002805 1729718 147     Helicobacter bizzozeronii CCUG 35545                  
0.07%   0.06%   76.61%  32.44%  71.26%  5       0       2463    537972  1901982 44      Helicobacter pullorum MIT 98-5489                     
0.06%   0.05%   76.58%  40.04%  71.27%  4       0       2463    679897  1539411 1       Helicobacter mustelae 12198                           
0.06%   0.05%   76.43%  37.98%  71.27%  4       0       2463    537970  1610516 24      Helicobacter canadensis MIT 98-5491                   
0.06%   0.05%   76.33%  36.61%  71.27%  4       0       2463    556267  1653215 21      Helicobacter winghamensis ATCC BAA-430                
0.06%   0.05%   76.16%  34.44%  71.27%  4       0       2463    1449345 1781940 29      Helicobacter rodentium ATCC 700285                    
0.05%   0.05%   76.06%  33.23%  71.27%  4       0       2463    1476199 1851996 61      Helicobacter sp. 12S02634-8                           
0.05%   0.05%   75.71%  29.32%  71.27%  4       0       2463    1325130 2093688 49      Helicobacter fennelliae MRY12-0050                    
0.05%   0.05%   75.72%  27.56%  71.39%  4       0       2398    211     2212464 224     Helicobacter sp. CLO-3                                
0.05%   0.03%   75.99%  43.25%  71.28%  3       0       2463    1408442 1419428 11      Helicobacter pametensis ATCC 51478                    
0.04%   0.03%   75.39%  34.70%  71.28%  3       0       2463    1905759 1770278 45      Helicobacter sp. 13S00477-4                           
                                                                                                                                              
                                                                                                                                              
Total Time:     2.198 seconds.

As you can see, it works very well! Of course, it would require some tweaking of assess_mash_screen.py, to parse sendsketch.sh output instead, and some additional testing like the testing and validation we performed for our mash screen evaluations, but it shouldn't be too much work to be honest.

We could consider eventually replacing mash screen with sendsketch.sh, thus removing the entire mash dependency, and removing the need to download a 700MB+ file with sketches of RefSeq genomes.

edit: Here's Brian Bushnell's "announcement" of sendsketsh.sh on BioStars: https://www.biostars.org/p/234837/

Rewrite FilterAssembly.pl to python

...and make sure it is python 3-compatible

Create documentation prototype using Sphinx

At minimum: create docs folder, and Sphinx skeleton.

Implement automated (regression) testing

As a stretch-goal for the next main version of the workflow, I'd like us to implement automated regression testing. It's not very difficult, and will improve our ability to catch small bugs, typos, etc. in pull requests and commits as early as possible.

In order to implement this in a useful way, we need at least one test case for each major path through the workflow. E.g. one gram positive, one gram negative, one with reference proteins, one which is contaminated, etc., so we can validate that all common paths through the workflow works as they should.

I'm labeling this issue as low priority and assigning it to the BACTpipe v3.0 milestone for now. There's no rush to implement this at this time.

Create flowchart showing branching workflow for version 2.0

For version 2.0, we need a flowchart showing the branching workflow.

Transfer bactpipe repository to CTMR github organization?

I propose we transfer ownership of the bactpipe_nextflow repository to the CTMR github organization when it reaches usable maturity. I also propose we change the repository name to just bactpipe when transferred to the CTMR organization.

It won't change anything for access of ease-of-use for us as developers, but it keeps it nicely under the CTMR organization umbrella. After such a transfer, it is still possible for all involved authors to fork the new main repository into their personal github accounts. This also improves the development workflow, as it enables individual researchers to develop code improvements in their local forks, which can the be transferred to the main repo via pull requests. This has the added benefit of making it easier for all involved parties to review all incoming changes.

Think about it, so we can discuss at a later time.

Remove the Tutorial section of the docs for now

To be reintroduced if someone has a reason and time to write a tutorial.

Make installation instructions including dependencies for local usage

Clarify dependencies in the documentation

We need to clarify in the documentation that users of the pipeline have to have progressiveMauve and t2blasn in their PATH or in the bactpipe bin.

Make assess_mash_screen.py find PhiX

...and if so, make bbduk also use the phiX genome file as well

Issues with automatic download of reference databases

There are currently a few kinks to work out regarding automatic downloading of reference databases (BBDuk adapters, mash screen db).

If automatically downloaded, the code as it is right now won't work.
If automatically downloaded, the downloaded file is put into a channel that can only be read once... ;(

I'll continue work on this asap.

Move `stats` into `shovill` step

There's little reason to run BBMap's stats.sh(or if we run statswrapper.sh, I'm not sure) as its own process, as it leads to unnecessary file copying/staging for a very little process. I think it's better to move it into the assembly step, and run it after shovill completes inside the same process.

Re-organize repo main folder for nextflow best practices

I propose changes to the main folder of the repo to better follow Nextflow best practices of repo organization. We will create a separate configuration folder, and introduce some new files to make it easier to run the workflow directly from the terminal without downloading it beforehand using Nextflow's built-in functionality for running workflows directly out of Github repos.

Make bbduk output gzipped files

... and the other programs to use the gzipped files as input files

Only output one draft genome fasta

I suggest to only output the draft genome in the form of the prokka .fna file and not also the contig.fa file from Shoville since these will be basically identical.

Remove FastML sources and binaries from repository

I just noticed that there are some large commits that I think might have been accidental. Amongst other things, the commit in question includes the source code for another program (FastML), as well as a compiled version. This is a severe problem. We can't have the source code and binaries (or source code tarballs) of other software commited into the repo. Firstly, because it in almost all cases violates the copyright and distribution licenses of these tools, secondly because it clutters and unnecessarily bloats the repository, and thirdly because it would give us a lot more work keeping these third-party tools up-to-date in our repository (if we're even allowed to include them in the first place).

Instead, I think we should put links to the websites of all important dependencies in the README so that pipeline users easily can find and download the prerequisite software. That way, we don't have to take responsibility to keep the latest version of these tools inside our own repository either. Just make sure to include the version of each tool that has been tested with the pipeline.

@b16joski, this is an excellent opportunity for you to learn how to remove accidentally commited files from a git repository. It is not enough to just remove them in a new commit, since they will still remain in the repository commit history. They need to be entirely removed from history as well. I've only done this kind of operation once, and that was in Mercurial, not Git.

As far as I know, there are only two reasonable options: git filter-branch or https://rtyley.github.io/bfg-repo-cleaner/ . Research what is the best option in this case and remove all the third-party files from the repository history.

Make sure you document your steps to solving this issue in this discussion thread so we can all refer to it in the future. Let us know if you encounter issues and need help. This is a tricky operation that could possibly screw up the repository (changing history is very precarious).

Create flowchart for version 1.0

Show all steps in the "one path" version of the pipeline, similar to the draft in Joseph's thesis.

Improve code style consistency

I will soon submit a pull request for changes to unify the coding style throughout the (nextflow) code base. I won't touch any of the third-party scripts.

Implement MultiQC

Add multiQC step at the end of the pipeline workflow

Optimize UPPMAX allocations

Reduce allocations for:

SPAdes to 3 cores
BBDuk to 1 core
prokka to 1 core
Mauve to 1 core

Include the strain ID in the fasta header during rename.py

So that the fasta headers in the clean file that are used as input to prokka are of the format

${sample_id}_contig1

Containerization using Docker

Background

Containerization is a technique to make it easier for end users to run stuff without having to install a lot of complicated dependencies. Common tools for this task are Docker and Singularity. Using containers is also very convenient when running on distributed cluster resources, as the container is a complete package containing all of the dependencies (libraries, scripts, tools, etc) required to run the different pipeline steps.

For our case with BACTpipe, containerization would mean that users only need to have Docker and Nextflow installed in their environment in order to run all of BACTpipe, without having to mess with installing any of the tools used inside the BACTpipe workflow (in theory, users would even be able to run BACTpipe on a Windows machine without too much trouble). Nextflow comes with Docker support out of the box, and it is very easy to use.

Containerization of BACTpipe

A simple approach to containerization of BACTpipe would be to just create a single container (e.g. a Docker image) that contains a miniature Linux environment with all of the tools used by BACTpipe. As Nextflow has built-in support, it is really quite easy to make a BACTpipe container and make Nextflow use that when executing the different processes in the BACTpipe workflow.

I already made an image for BACTpipe version 2.2-dev that works just fine. I recently pushed a new branch, docker_test, that makes some tiny changes to the Nextflow configuration to make it run the workflow processes inside Docker containers based on the Docker image I made. Nextflow still runs on your machine, but all the processes inside the workflow run inside instances of the Docker container.

Improvements

Despite it working very well for now, there are still some improvements to be made:

Reducing Docker image size. Unfortunately, all of the tools used in BACTpipe have lots of dependencies, which creates a rather big Docker image. By carefully removing unnecessary components in the container, or construct a more lean image, the size can be reduced.
Developing a Singularity container as well, to cater for different users' needs (Singularity is quite popular at C3SE in Gothenburg for example).
Documentation for how to use BACTpipe with and without containers will be needed, to make it easy for users to understand what's going on.

Make sure rename_fasta.py is python 3 compatible

Go through prokka parameters for 2.0

Make config file for Rackham

Create a tagged release for (the old) version 1

Github has features to create "releases" using tagged commits. Now that we have an old version of the pipeline (version 1) available in a separate branch, it is easy to create a tagged release for it. Would you read up on how to do that @b16joski? Ask me if you get stuck.