Giter VIP home page Giter VIP logo

holoflow's Issues

Calculate the % of reads mapped to contigs

It's important to know what proportion of your metagenome that you're able to assign to the contigs/bins that you've assembled.

One way to do this would be to add a rule to the individual/coassembly binning snakefiles:
samtools flagstat -@ {params.threads} {input.BAMs}

The output of this could be parsed with grep and sed to give the % of reads mapped, and how many reads are multimapped (mapped to multiple places).

[FEATURE] Replace BWA mem with Bowtie2 in preprocessing.py

Upon testing the current preprocessing.py pipeline, I wanted to double-check that the host read removal step was working properly. The initial pipeline used BWA-mem.

My curiosity was piqued when I found that there were a substantial number of reads mapping to the human genome from a wombat faecal sample. I would not expect much, as any contaminating human DNA if present, should be at low concentrations compared to the high biomass of the faecal sample (I did the lab work very clean too):

  • reads passed filtering/QC: 859,323
  • microbial reads after mapping to host: 310,617
  • reads in host BAM file: 548,342!!!

I subsampled the faecal sample to 1 million paired reads (2x150bp), and used the preprocessing pipeline with the program's/pipeline's default 'loose' setting, the host reference genomes used were the human + wombat. I inspected the assembly at the read alignment level in geneious, and found that the majority of alignments were like this:
Untitled

As you can see, BWA mem is clipping ~120 bp from the alignment and only considering the 19 bp that matches exactly with the reference. This makes sense, as by default the seed length (-k) is 19 bp, the alignment score (-A) is 1, and the clipping penalty (-L) is 5.

In this situation, the alignment score is 14 (1*19 exact matches [from -A], - 5 [from -L)

What's happening is that the clipping score is not high enough to remove these spurious alignments -- I wouldn't trust alignments shorter than 30bp, as the probability of, for example, finding any 19mer in 6 Gb reference is quite high.

Trying above again with Bowtie2:

  • reads passed filtering/QC: 859,323
  • microbial reads after mapping to host: 849,241
  • reads in host BAM file: 9,943

Looks much better.

Simulated data
I also tested this by simulating a metagenome (30 species) that I subsampled to 1 million paired reads (2x150). I spiked in human reads at ~35%:

Using the default Bowtie2 (--sensitive & --end-to-end alignment)

  • pass filtering/qc: 975,477
  • microbial reads after mapping to host: 610,394
  • human reads that were not mapped: 380 (or 0.1%) [I can determine this because the read headers have the source genome from where they were derived]
  • No microbial reads were mapped to the human/wombat reference.

Next step is to implement Bowtie2 into the preprocessing.py pipeline.

Make a script to set up file structure/scripts, etc.

The pipeline is currently very clunky to get started; one needs to create folders, set up input/config files, and multiple scripts to run and submit the different pipelines.

I want to make a simple script that automates this process. I've made a start -- see holo-setup.sh.

[ENHANCEMENT] Automate/optimise preprocessing report creation

What is wanted:
A tsv or html report summarising key metrics for the preprocessing pipeline, e.g.

  • col1 (sample name)
  • col2 (reads raw)
  • col3 (reads passed fastp)
  • coln..n (reads mapped to genome n)
  • col5 (reads passed host mapping)
  • col6 (% reads passed host mapping)
  • col7 (GBP passed host mapping)

How:
Can implement CoverM to measure number of reads mapping to each host genome, then do col3-coln..n to optimise calculation of col5 (i.e. can get rid of the decompression/counting step in holo-map-split-ref.py)

[FEATURE] Implement fastp in preprocessing:

Rationale:

  • Faster than AdapterRemoval2
  • Includes optional deduplication and low-complexity read removal steps
  • Can send output to STDout -> pipe into BWA to save I/O and space

Also, remove current .stats file generation, for this step, as currently it has to decompress to count # reads in each sample pre/post quality filtering. Fastp creates reports with this information that we could then parse the info from.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.