Giter VIP home page Giter VIP logo

bactpipe's People

Contributors

abhi18av avatar b16joski avatar boulund avatar emilio-r avatar emilyncosta avatar huyue87 avatar thorellk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

bactpipe's Issues

Break out tool parameter settings to a single shared config file

I suggest we consider putting all user-configureable tool parameters (e.g. bbduk trimming settings, assembly kmer sizes, maybe some prokka parameters, etc) in a single configuration file. Having all of these settings in a single file makes it easier to properly maintain, and also easier for end-users to access and modify parameters as desired, without having to go into the actual code for the main Nextflow workflow.

There are a few models for how to do this, but I suggest @b16joski takes a look at a few other Nextflow-based pipelines to see how the manage hierarchical configuration files and profiles. (e.g. https://github.com/ewels/NGI-RNAseq could be a starting point, but I don't think their solution is exactly what we are after).

Restructure automatic BBDuk adapters functionality

It felt wrong having to always specify the path to adapters.fa for BBDuk, and I did some research and obviously it is possible for BBDuk to use the adapters.fa file that comes bundled with BBMap when installing it, without having to specify the entire path. It's dead easy: adapters is a keyword that BBDuk knows means the adapters file bundled with the installation. We can just give BBDuk ref=adapters instead of ref=/path/to/adapters.fa and it will automatically use the adapters file that is bundled. This removes the need to automatically download it if no path for a custom adapters.fa is given by the user, and cleans up a lot of the messy code we recently introduced into bactpipe.nf to handle all the edge cases. I really think we should consider fixing this before releasing BACTpipe publicly.

Investigate Shovill to replace SPAdes

  • Research Shovill details, how it works, what it does better than SPAdes etc.
  • Make list of necessary changes to pipeline if Shovill is implemented

Branch off version 1 into separate branch for archival

I suggest we create a branch to hold the last version of BACTpipe version1 in the repo. I don't expect any of us are interested in maintaining it now that we've improved so much in version 2, so let's just keep it as an archive in a side branch.
We could easily do this before merging the PR from Joseph's version 2 branch.

Re-add `params.project` to configuration files

I must have accidentally removed params.project from the configuration files when I restructured everything a while back. Currently, you need to specify --project <SLURMPROJECT> when running BACTpipe on Rackham or Milou.

I suggest we add params.project = "" to the default config file (params.config), and add some logic to the main workflow to show a warning and exit gracefully if it has not been set properly.

Replace mash screen with BBTool's sendsketch.sh?

I missed that Brian Bushnell last year added a tool to the BBTools suite that does something extremely similar to what mash screen does: sendsketch.sh.

It sends a sketch of an input file (or pair of input files, compressed or not, as per usual BBTool manners) to JGI's sketch-server to compare against reference sketches of nt, refseq (default), silva, or img. It is very fast. Here's an example using the first 1000 or so reads from a Helicobacter pylori sample:

(base) [fredrik.boulund@ctmr-nas bactpipe_test]$ sendsketch.sh in=sample_R1.fastq.gz                                                          
Adding /home/ctmr/anaconda3/opt/bbmap-37.90/resources/blacklist_refseq_species_300.sketch to blacklist.                                       
Loaded 1 sketch in      0.716 seconds.                                                                                                        
                                                                                                                                              
Query: HWI-M03284:45:000000000-AWVDC:1:1101:13827:1954 1:N:0:GAATTCGTGTACTGAC   DB: RefSeq      SketchLen: 8587 Seqs: 26873     Bases: 8088773   gSize: 2134365  Quality: 0.9074 File: sample_R1.fastq.gz                                                                                      
WKID    KID     ANI     Complt  Contam  Matches Unique  noHit   TaxID   gSize   gSeqs   taxName                                               
36.19%  30.08%  96.37%  54.43%  34.52%  2752    0       2227    102617  2560243 63      Helicobacter pylori SS1                               
7.66%   5.46%   91.08%  47.88%  65.86%  469     3       2463    382638  1530316 2       Helicobacter acinonychis str. Sheeba                  
0.95%   0.77%   84.42%  36.38%  70.55%  66      0       2463    1163745 1730296 2       Helicobacter cetorum MIT 99-5656                      
0.79%   0.55%   83.87%  42.33%  70.77%  47      0       2463    104628  1475704 42      Helicobacter suis                                     
3.03%   0.03%   88.06%  100.00% 71.28%  3       0       2463    1204178 26419   1       Helicobacter phage KHP40                              
0.31%   0.07%   81.06%  6.77%   70.61%  10      0       937     543736  9071270 292     Rhodococcus opacus PD630                              
0.12%   0.12%   78.34%  30.08%  71.20%  10      0       2463    1578720 2046283 204     Helicobacter ailurogastricus                          
0.12%   0.09%   78.37%  37.89%  71.22%  8       0       2463    936155  1631038 1       Helicobacter felis ATCC 49179                         
0.11%   0.09%   78.10%  22.32%  70.79%  8       0       2087    35817   2772762 392     Helicobacter heilmannii                               
0.09%   0.07%   77.36%  35.40%  71.25%  6       0       2463    1002805 1729718 147     Helicobacter bizzozeronii CCUG 35545                  
0.07%   0.06%   76.61%  32.44%  71.26%  5       0       2463    537972  1901982 44      Helicobacter pullorum MIT 98-5489                     
0.06%   0.05%   76.58%  40.04%  71.27%  4       0       2463    679897  1539411 1       Helicobacter mustelae 12198                           
0.06%   0.05%   76.43%  37.98%  71.27%  4       0       2463    537970  1610516 24      Helicobacter canadensis MIT 98-5491                   
0.06%   0.05%   76.33%  36.61%  71.27%  4       0       2463    556267  1653215 21      Helicobacter winghamensis ATCC BAA-430                
0.06%   0.05%   76.16%  34.44%  71.27%  4       0       2463    1449345 1781940 29      Helicobacter rodentium ATCC 700285                    
0.05%   0.05%   76.06%  33.23%  71.27%  4       0       2463    1476199 1851996 61      Helicobacter sp. 12S02634-8                           
0.05%   0.05%   75.71%  29.32%  71.27%  4       0       2463    1325130 2093688 49      Helicobacter fennelliae MRY12-0050                    
0.05%   0.05%   75.72%  27.56%  71.39%  4       0       2398    211     2212464 224     Helicobacter sp. CLO-3                                
0.05%   0.03%   75.99%  43.25%  71.28%  3       0       2463    1408442 1419428 11      Helicobacter pametensis ATCC 51478                    
0.04%   0.03%   75.39%  34.70%  71.28%  3       0       2463    1905759 1770278 45      Helicobacter sp. 13S00477-4                           
                                                                                                                                              
                                                                                                                                              
Total Time:     2.198 seconds.                                                                                                                

As you can see, it works very well! Of course, it would require some tweaking of assess_mash_screen.py, to parse sendsketch.sh output instead, and some additional testing like the testing and validation we performed for our mash screen evaluations, but it shouldn't be too much work to be honest.

We could consider eventually replacing mash screen with sendsketch.sh, thus removing the entire mash dependency, and removing the need to download a 700MB+ file with sketches of RefSeq genomes.

edit: Here's Brian Bushnell's "announcement" of sendsketsh.sh on BioStars: https://www.biostars.org/p/234837/

Implement automated (regression) testing

As a stretch-goal for the next main version of the workflow, I'd like us to implement automated regression testing. It's not very difficult, and will improve our ability to catch small bugs, typos, etc. in pull requests and commits as early as possible.

In order to implement this in a useful way, we need at least one test case for each major path through the workflow. E.g. one gram positive, one gram negative, one with reference proteins, one which is contaminated, etc., so we can validate that all common paths through the workflow works as they should.

I'm labeling this issue as low priority and assigning it to the BACTpipe v3.0 milestone for now. There's no rush to implement this at this time.

Transfer bactpipe repository to CTMR github organization?

I propose we transfer ownership of the bactpipe_nextflow repository to the CTMR github organization when it reaches usable maturity. I also propose we change the repository name to just bactpipe when transferred to the CTMR organization.

It won't change anything for access of ease-of-use for us as developers, but it keeps it nicely under the CTMR organization umbrella. After such a transfer, it is still possible for all involved authors to fork the new main repository into their personal github accounts. This also improves the development workflow, as it enables individual researchers to develop code improvements in their local forks, which can the be transferred to the main repo via pull requests. This has the added benefit of making it easier for all involved parties to review all incoming changes.

Think about it, so we can discuss at a later time.

Issues with automatic download of reference databases

There are currently a few kinks to work out regarding automatic downloading of reference databases (BBDuk adapters, mash screen db).

  • If automatically downloaded, the code as it is right now won't work.
  • If automatically downloaded, the downloaded file is put into a channel that can only be read once... ;(

I'll continue work on this asap.

Move `stats` into `shovill` step

There's little reason to run BBMap's stats.sh(or if we run statswrapper.sh, I'm not sure) as its own process, as it leads to unnecessary file copying/staging for a very little process. I think it's better to move it into the assembly step, and run it after shovill completes inside the same process.

Re-organize repo main folder for nextflow best practices

I propose changes to the main folder of the repo to better follow Nextflow best practices of repo organization. We will create a separate configuration folder, and introduce some new files to make it easier to run the workflow directly from the terminal without downloading it beforehand using Nextflow's built-in functionality for running workflows directly out of Github repos.

Only output one draft genome fasta

I suggest to only output the draft genome in the form of the prokka .fna file and not also the contig.fa file from Shoville since these will be basically identical.

Remove FastML sources and binaries from repository

I just noticed that there are some large commits that I think might have been accidental. Amongst other things, the commit in question includes the source code for another program (FastML), as well as a compiled version. This is a severe problem. We can't have the source code and binaries (or source code tarballs) of other software commited into the repo. Firstly, because it in almost all cases violates the copyright and distribution licenses of these tools, secondly because it clutters and unnecessarily bloats the repository, and thirdly because it would give us a lot more work keeping these third-party tools up-to-date in our repository (if we're even allowed to include them in the first place).

Instead, I think we should put links to the websites of all important dependencies in the README so that pipeline users easily can find and download the prerequisite software. That way, we don't have to take responsibility to keep the latest version of these tools inside our own repository either. Just make sure to include the version of each tool that has been tested with the pipeline.

@b16joski, this is an excellent opportunity for you to learn how to remove accidentally commited files from a git repository. It is not enough to just remove them in a new commit, since they will still remain in the repository commit history. They need to be entirely removed from history as well. I've only done this kind of operation once, and that was in Mercurial, not Git.

As far as I know, there are only two reasonable options: git filter-branch or https://rtyley.github.io/bfg-repo-cleaner/ . Research what is the best option in this case and remove all the third-party files from the repository history.

Make sure you document your steps to solving this issue in this discussion thread so we can all refer to it in the future. Let us know if you encounter issues and need help. This is a tricky operation that could possibly screw up the repository (changing history is very precarious).

Improve code style consistency

I will soon submit a pull request for changes to unify the coding style throughout the (nextflow) code base. I won't touch any of the third-party scripts.

Containerization using Docker

Background

Containerization is a technique to make it easier for end users to run stuff without having to install a lot of complicated dependencies. Common tools for this task are Docker and Singularity. Using containers is also very convenient when running on distributed cluster resources, as the container is a complete package containing all of the dependencies (libraries, scripts, tools, etc) required to run the different pipeline steps.

For our case with BACTpipe, containerization would mean that users only need to have Docker and Nextflow installed in their environment in order to run all of BACTpipe, without having to mess with installing any of the tools used inside the BACTpipe workflow (in theory, users would even be able to run BACTpipe on a Windows machine without too much trouble). Nextflow comes with Docker support out of the box, and it is very easy to use.

Containerization of BACTpipe

A simple approach to containerization of BACTpipe would be to just create a single container (e.g. a Docker image) that contains a miniature Linux environment with all of the tools used by BACTpipe. As Nextflow has built-in support, it is really quite easy to make a BACTpipe container and make Nextflow use that when executing the different processes in the BACTpipe workflow.

I already made an image for BACTpipe version 2.2-dev that works just fine. I recently pushed a new branch, docker_test, that makes some tiny changes to the Nextflow configuration to make it run the workflow processes inside Docker containers based on the Docker image I made. Nextflow still runs on your machine, but all the processes inside the workflow run inside instances of the Docker container.

Improvements

Despite it working very well for now, there are still some improvements to be made:

  • Reducing Docker image size. Unfortunately, all of the tools used in BACTpipe have lots of dependencies, which creates a rather big Docker image. By carefully removing unnecessary components in the container, or construct a more lean image, the size can be reduced.
  • Developing a Singularity container as well, to cater for different users' needs (Singularity is quite popular at C3SE in Gothenburg for example).
  • Documentation for how to use BACTpipe with and without containers will be needed, to make it easy for users to understand what's going on.

Create a tagged release for (the old) version 1

Github has features to create "releases" using tagged commits. Now that we have an old version of the pipeline (version 1) available in a separate branch, it is easy to create a tagged release for it. Would you read up on how to do that @b16joski? Ask me if you get stuck.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.