anttonalberdi / holoflow Goto Github PK
View Code? Open in Web Editor NEWBioinformatics pipeline for hologenomics data generation and analysis
License: GNU General Public License v3.0
Bioinformatics pipeline for hologenomics data generation and analysis
License: GNU General Public License v3.0
It's important to know what proportion of your metagenome that you're able to assign to the contigs/bins that you've assembled.
One way to do this would be to add a rule to the individual/coassembly binning snakefiles:
samtools flagstat -@ {params.threads} {input.BAMs}
The output of this could be parsed with grep and sed to give the % of reads mapped, and how many reads are multimapped (mapped to multiple places).
I've noticed that quite a few outputs are uncompressed by default (assemblies in COA/IA, .fa's in DR/FS).
Upon testing the current preprocessing.py pipeline, I wanted to double-check that the host read removal step was working properly. The initial pipeline used BWA-mem.
My curiosity was piqued when I found that there were a substantial number of reads mapping to the human genome from a wombat faecal sample. I would not expect much, as any contaminating human DNA if present, should be at low concentrations compared to the high biomass of the faecal sample (I did the lab work very clean too):
I subsampled the faecal sample to 1 million paired reads (2x150bp), and used the preprocessing pipeline with the program's/pipeline's default 'loose' setting, the host reference genomes used were the human + wombat. I inspected the assembly at the read alignment level in geneious, and found that the majority of alignments were like this:
As you can see, BWA mem is clipping ~120 bp from the alignment and only considering the 19 bp that matches exactly with the reference. This makes sense, as by default the seed length (-k) is 19 bp, the alignment score (-A) is 1, and the clipping penalty (-L) is 5.
In this situation, the alignment score is 14 (1*19 exact matches [from -A], - 5 [from -L)
What's happening is that the clipping score is not high enough to remove these spurious alignments -- I wouldn't trust alignments shorter than 30bp, as the probability of, for example, finding any 19mer in 6 Gb reference is quite high.
Trying above again with Bowtie2:
Looks much better.
Simulated data
I also tested this by simulating a metagenome (30 species) that I subsampled to 1 million paired reads (2x150). I spiked in human reads at ~35%:
Using the default Bowtie2 (--sensitive & --end-to-end alignment)
Next step is to implement Bowtie2 into the preprocessing.py pipeline.
The pipeline is currently very clunky to get started; one needs to create folders, set up input/config files, and multiple scripts to run and submit the different pipelines.
I want to make a simple script that automates this process. I've made a start -- see holo-setup.sh.
What is wanted:
A tsv or html report summarising key metrics for the preprocessing pipeline, e.g.
How:
Can implement CoverM to measure number of reads mapping to each host genome, then do col3-coln..n to optimise calculation of col5 (i.e. can get rid of the decompression/counting step in holo-map-split-ref.py)
The holo-in-reformat takes quite a long time -- especially on large FQs. In the next few days, I'll test to see if bbrename.sh (from the BBmap suite) is faster/more efficient (potentially multithreaded too).
Rationale:
Also, remove current .stats file generation, for this step, as currently it has to decompress to count # reads in each sample pre/post quality filtering. Fastp creates reports with this information that we could then parse the info from.
I couldn't get DAS-tool to run within the coassembly binning pipeline: seems to be some special type of RUBY dependency hell ๐.
Therefore, I've replaced it with the metaWRAP bin refiner, which works as well if not better: https://github.com/bxlab/metaWRAP
Next step is to implement in the individual binning pipeline.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.