anttonalberdi / holoflow Goto Github PK

Bioinformatics pipeline for hologenomics data generation and analysis

License: GNU General Public License v3.0

Python 97.88% Shell 0.73% R 1.40%

holoflow's Issues

Calculate the % of reads mapped to contigs

It's important to know what proportion of your metagenome that you're able to assign to the contigs/bins that you've assembled.

One way to do this would be to add a rule to the individual/coassembly binning snakefiles:
samtools flagstat -@ {params.threads} {input.BAMs}

The output of this could be parsed with grep and sed to give the % of reads mapped, and how many reads are multimapped (mapped to multiple places).

Be more ruthless with compression:

I've noticed that quite a few outputs are uncompressed by default (assemblies in COA/IA, .fa's in DR/FS).

[FEATURE] Replace BWA mem with Bowtie2 in preprocessing.py

Upon testing the current preprocessing.py pipeline, I wanted to double-check that the host read removal step was working properly. The initial pipeline used BWA-mem.

My curiosity was piqued when I found that there were a substantial number of reads mapping to the human genome from a wombat faecal sample. I would not expect much, as any contaminating human DNA if present, should be at low concentrations compared to the high biomass of the faecal sample (I did the lab work very clean too):

reads passed filtering/QC: 859,323
microbial reads after mapping to host: 310,617
reads in host BAM file: 548,342!!!

I subsampled the faecal sample to 1 million paired reads (2x150bp), and used the preprocessing pipeline with the program's/pipeline's default 'loose' setting, the host reference genomes used were the human + wombat. I inspected the assembly at the read alignment level in geneious, and found that the majority of alignments were like this:

As you can see, BWA mem is clipping ~120 bp from the alignment and only considering the 19 bp that matches exactly with the reference. This makes sense, as by default the seed length (-k) is 19 bp, the alignment score (-A) is 1, and the clipping penalty (-L) is 5.

In this situation, the alignment score is 14 (1*19 exact matches [from -A], - 5 [from -L)

What's happening is that the clipping score is not high enough to remove these spurious alignments -- I wouldn't trust alignments shorter than 30bp, as the probability of, for example, finding any 19mer in 6 Gb reference is quite high.

Trying above again with Bowtie2:

reads passed filtering/QC: 859,323
microbial reads after mapping to host: 849,241
reads in host BAM file: 9,943

Looks much better.

Simulated data
I also tested this by simulating a metagenome (30 species) that I subsampled to 1 million paired reads (2x150). I spiked in human reads at ~35%:

Using the default Bowtie2 (--sensitive & --end-to-end alignment)

pass filtering/qc: 975,477
microbial reads after mapping to host: 610,394
human reads that were not mapped: 380 (or 0.1%) [I can determine this because the read headers have the source genome from where they were derived]
No microbial reads were mapped to the human/wombat reference.

Next step is to implement Bowtie2 into the preprocessing.py pipeline.

Make a script to set up file structure/scripts, etc.

The pipeline is currently very clunky to get started; one needs to create folders, set up input/config files, and multiple scripts to run and submit the different pipelines.

I want to make a simple script that automates this process. I've made a start -- see holo-setup.sh.

[ENHANCEMENT] Automate/optimise preprocessing report creation

What is wanted:
A tsv or html report summarising key metrics for the preprocessing pipeline, e.g.

col1 (sample name)
col2 (reads raw)
col3 (reads passed fastp)
coln..n (reads mapped to genome n)
col5 (reads passed host mapping)
col6 (% reads passed host mapping)
col7 (GBP passed host mapping)

How:
Can implement CoverM to measure number of reads mapping to each host genome, then do col3-coln..n to optimise calculation of col5 (i.e. can get rid of the decompression/counting step in holo-map-split-ref.py)

[ENHANCEMENT] Look into using BBmap's bbrename.sh for holo-in-reformat

The holo-in-reformat takes quite a long time -- especially on large FQs. In the next few days, I'll test to see if bbrename.sh (from the BBmap suite) is faster/more efficient (potentially multithreaded too).

[FEATURE] Implement fastp in preprocessing:

Rationale:

Faster than AdapterRemoval2
Includes optional deduplication and low-complexity read removal steps
Can send output to STDout -> pipe into BWA to save I/O and space

Also, remove current .stats file generation, for this step, as currently it has to decompress to count # reads in each sample pre/post quality filtering. Fastp creates reports with this information that we could then parse the info from.

Replace DAS-tool with MetaWRAP in metagenomics pipeline:

I couldn't get DAS-tool to run within the coassembly binning pipeline: seems to be some special type of RUBY dependency hell 😈.

Therefore, I've replaced it with the metaWRAP bin refiner, which works as well if not better: https://github.com/bxlab/metaWRAP

Next step is to implement in the individual binning pipeline.

anttonalberdi / holoflow Goto Github PK

holoflow's Issues

Calculate the % of reads mapped to contigs

Be more ruthless with compression:

[FEATURE] Replace BWA mem with Bowtie2 in preprocessing.py

Make a script to set up file structure/scripts, etc.

[ENHANCEMENT] Automate/optimise preprocessing report creation

[ENHANCEMENT] Look into using BBmap's bbrename.sh for holo-in-reformat

[FEATURE] Implement fastp in preprocessing:

Replace DAS-tool with MetaWRAP in metagenomics pipeline:

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent