Giter VIP home page Giter VIP logo

mbhall88 / head_to_head_pipeline Goto Github PK

View Code? Open in Web Editor NEW
5.0 4.0 2.0 512.98 MB

Snakemake pipelines to run the analysis for the Illumina vs. Nanopore comparison.

License: GNU General Public License v3.0

Python 0.95% Shell 0.02% Jupyter Notebook 55.14% Perl 0.01% HTML 43.88% Jinja 0.01%
tuberculosis snakemake pipeline bioinformatics nanopore illumina drug-resistance transmission clustering diagnostics

head_to_head_pipeline's Introduction

Paper

Hall, M. B. et al. Evaluation of Nanopore sequencing for Mycobacterium tuberculosis drug susceptibility testing and outbreak investigation: a genomic analysis. The Lancet Microbe 0, (2022) doi: 10.1016/S2666-5247(22)00301-9.


This repository holds the pipelines/scripts used for our paper analysing Illumina and Nanopore for M.tuberculosis drug resistance calling and transmission clustering.

For people wanting to analyse their Nanopore data in the same manner as we did in this paper, we would suggest using https://github.com/mbhall88/tbpore, which is a python program that runs the drug resistance prediction and clustering (with a smaller decontamination database) components of this pipeline. It is actively maintained and much easier to use.

All pipelines require the following dependencies to be installed:

See subdirectories for more specific information about different pipelines. They are nested according to their dependence on the outputs of each pipeline.

The following pipelines are not relevant to the work in the final paper.

Data availability

All data is submitted under the Project accession PRJEB49093.

The accessions and all relevant sample metadata for this study can be found at https://doi.org/10.6084/m9.figshare.19304648.

The raw Nanopore data is available to download from: https://ftp.ebi.ac.uk/pub/databases/ont_tb_eval2022/. See the sample metadata file for mappings between samples and the relevant Nanopore runs and barcode numbers.

head_to_head_pipeline's People

Contributors

mbhall88 avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

head_to_head_pipeline's Issues

Decontamination of reads

Following the process that clockwork uses for decontamination seems the best approach.

Shall we set an arbitrary threshold, N, where if a sample has >N% NTM content, we exclude the sample from the project entirely?

At the end of this process, we have a fastq for each sample that has just "good quality" reads (i.e. not NTM/Human). This fastq will become the one used downstream for all other aspects of the paper (except maybe DST).

Baseline variant analysis

The truth set of variants for the Illumina data will come from the PHE variant calling pipeline compass.
As a baseline for the Nanopore data, we will run bcftools, with some filtering of variants to remove low-quality calls and with a mask to avoid repetitive and structurally variable regions of the Mtb genome.

The results for this subsection will show concordance of the basic Nanopore variant calling with the Illumina truth set. We will discuss how the different filters trialled affected the results and plot recall against concordance.

Variant caller to use for baseline analysis

We initially said we would use clockwork to do the baseline Illumina variant calling. But, I raised the point that using COMPASS may actually be better to use for this paper as this is the current "standard" - as clockwork is not published or in production yet.

Excluding samples based on coverage

Do we want to exclude samples based on coverage? If so, what threshold should we set?

The coverage for the nanopore data can be seen on a per-sample basis for the Malagasy and South African data in the below figures.

TODO

  • add figures
  • get coverage information for Illumina data
  • provide stats on coverage at a project-wide level

Project-wide genome mask

Do we use

  • Oxford mask
  • Maha (Max Marin) mask

I prefer Max's mask for various reasons (mostly the transparency of how it was created). But, not sure if you are hesitant to move away from the Oxford one @iqbal-lab?

If we do want to use Max's one I guess we probably need to run it by them first?

Demultiplex Birmingham nanopore data

  • tb_birmingham_4
  • mgit_phe_1
  • mgit_phe_2
  • birmingham-mgit-run3
  • birmingham-mgit-run4
  • birmingham-mgit-run5
  • birmingham-mgit-run6
  • birmingham-mgit-run7

This script was used to demultiplex all data in the following way.

demux -i guppy_v3.4.5/fastqs -o guppy_v3.4.5/demux -z --trim -t 8

This assumes the default guppy version 3.4.5 and the default barcode kit of EXP-NBD103

Call per-sample variants with pandora

Tasks that fall under this issue:

  • pandora map looking for de novo variants
  • update MSAs with discovered sequences
  • turn updated MSAs into PRGs (make_prg)
  • index new PRG
  • pandora map not looking for de novo variants

Normalise pandora map VCFs

Here we want to make the VCFs output from #46 interoperable with the compass and bcftools VCFs that are based on H37Rv. The main task here really is just adjusting the POS column to the corresponding position on H37Rv - ensuring information about the local PRG name and coordinates are retained in the INFO field.

Apply variants to H37Rv reference chunks

For the dense and sparse H37Rv PRGs [#3] we need to add a certain amount of variants to our PRG. After we have ranked the variants with FORGe [#2] we can subset this ranking into dense and sparse.
Then, for every variant in the ranking, we take the reference sequence for the "chunk" [#9] that position falls in and apply the variant to it. We then add the mutated sequence to a fasta file for that chunk.

Clarify usage of pandora compare

This issue is for discussing the role of pandora compare in this paper (if it does play a role?).

@iqbal-lab will we also run the samples through compare? If so, what questions, specifically, are we trying to answer with it? Is it the same as the per-sample, but just trying to see if compare gives better results?

Basic analysis of per-sample variant calls

The aim of this subsection will be to try varying degrees of PRG complexity for Mtb sample analysis. Refer to #3 for details on construction.

For each of the PRGs, we plan to perform the following analysis. Quantify the number of SNPs and indels pandora calls per-sample, on average, and see how this compares to the baseline [#4] and truth. We will report on the concordance rate, which is the proportion of shared sites between pandora and the truth that agree.

Lineage-specific VCFs

Split the large CRyPTIC VCF, with all samples in it, into lineage-specific VCFs to make variant ranking easier [#2].

Drug susceptibility prediction

Validate concordance with mykrobe for Illumina and Nanopore. In particular, the analysis will focus on the (hopefully) lower coverage required by pandora to achieve the same, or better, results as mykrobe, and the increased detection power provided by de novo variant discovery.

Produce QC report

At the end of the QC pipeline a report with the following information would be useful

  • Fraction of reads mapped to contaminants (per-sample)
  • Fraction of reads that did not map to the decontamination database (per-sample)
  • krona plots showing sample composition (per-sample)
  • Coverage plots for final fastq
  • Lineage information of samples

Convert single-fast5 to multi-fast5

  • tb_birmingham_4
  • mgit_phe_1
  • mgit_phe_2
  • birmingham-mgit-run1
  • birmingham-mgit-run2
  • birmingham-mgit-run3
  • birmingham-mgit-run4
  • birmingham-mgit-run5
  • birmingham-mgit-run6
  • birmingham-mgit-run7

Construct PRGs

We plan to try out a variety of PRGs in this project

  • Linear PRG - This will just be the standard M. tuberculosis reference H37Rv
  • Sparse PRG - H37Rv and all variants at frequency above 10%, from CRyPTIC
  • Dense PRG - The same as Sparse PRG, but with variants at frequency above 1%.
  • Representative PRG - This will contain two high-quality genomes from each lineage, plus all variants from the above dataset at >=5% frequency.

Concordance of baseline variant callers

We will look at two different kinds of concordance:

  1. Positions where COMPASS says ALT, how often does bcftools on nanopore agree?
  2. At all positions where COMPASS "makes a statement", how often does bcftools on nanopore agree?

Use FORGe to rank variants

FORGe is a tool for ranking variants and building an optimal graph genome. We are specifically interested in using it to help us select the variants we want to use for constructing our various PRGs.

SNP distance between Illumina and Nanopore calls

One of the main figures for the paper will show a scatter plot of pairwise SNP distances. The two axes will be Illumina SNP distance and Nanopore SNP distance for each pair of samples. If clustering with nanopore is interchangeable with Illumina clustering we would hope to see a perfect diagonal line from the bottom left to top right indicating the distance between samples is the same for both technology's SNP calls. Failing this, we would hope for a linear relationship of some sort which would allow us to provide a recommendation for SNP thresholds when using Nanopore.

Basecall Birmingham nanopore data

  • tb_birmingham_4
  • mgit_phe_1
  • mgit_phe_2
  • birmingham-mgit-run3
  • birmingham-mgit-run4
  • birmingham-mgit-run5
  • birmingham-mgit-run6
  • birmingham-mgit-run7

All data is basecalled using this script and run using default parameters. That is, guppy version 3.4.5 and basecalling model dna_r9.4.1_450bps_hac_prom.cfg

Investigate concordance outliers

concordance-outliers

From this plot of call rate against concordance, we can see there are 5 samples that are clear outliers when it comes to concordance. These 5 samples are:

  • mada_2-46
  • R27006
  • R13303
  • mada_1-34
  • mada_1-50

I looked at a random selection of VCF positions where these samples made incorrect calls and could not see any discernable metric causing this low concordance.

For the following plots I extracted all positions (for each of the 5 samples) where bcftools made an incorrect (false) call and compared those to positions where it made a correct (true) call

Strand bias

image.png

There does not appear to be any enrichment for strand bias in the false calls.

VCF QUAL score

image.png

Note: the false calls have a higher QUAL due to the fact that most did not have an ALT, hence the QUAL calculation is slightly different, yielding a higher QUAL score. See here for elaboration on how the calculation is different.
The main takeaway here is that if we were to raise the minimum QUAL filtering cutoff to 30, we would remove the smaller peak in the false calls. This would, of course, be at the expense of a drop in call rate. @iqbal-lab do you think this is an acceptable change to the filtering?

Median depth

image.png

Strangely, the false calls seem to be higher depth. I'm not sure there is much that can be done here.


Conclusion

There does not appear to be any simple fix with respect to filters that can explain the low concordance of these samples.

Next thing

Plot the concordance for each of these samples against all other samples to see if they may be sample swaps. For instance, is mada_2-46 has better concordance with a different sample, it is quite likely it has been swapped and we may need to exclude these samples.

Assign lineages

Assign lineages to the samples in the cryptic ~15000 sample VCF. This is important to allow us to split the VCF up into lineage VCFs to ensure we don't "drown" out the under-represented lineages when we build the PRGs [#3].

Evaluate basecalling model

(Proposed) Methods for evaluating the Mtb-specific guppy basecalling model.

1

Create a file mapping read IDs (for the evaluation reads) to the sample they come from.

2

Split the evaluation reads for each basecalling model into per-sample fastq files. This will result in two fastq files for each sample.

Note: There is an inherit acceptance in this that the barcode assignment from the original guppy model is to be taken as "truth"

3

Map each fastq to the reference assembly for that sample, using minimap2 (output PAF).
Non-standard flags to use:

  • -c generate CIGAR
  • -2 use two IO threads
  • --secondary=no do not output secondary alignments

Q: Are we happy to ignore unmapped, secondary, and supplementary alignments? I think we should, but better to be explicit with this decision.

4

Calculate the 'BLAST identity' for each alignment. In PAF format, this is achieved by dividing column 10 (Number of matching bases in the mapping) by column 11 (Number bases, including gaps, in the mapping). BLAST identity is more suitable than gap-compressed identity for this assessment as we want the length of indels to affect the identity metric.

5

Generate assemblies with Rebaler. Ryan specifically created Rebaler for assessing consensus accuracy when comparing basecalling models. It is effectively reference-guided assembly with some Racon polishing thrown in. i.e. uses the structure of the original assembly and replaces the bases with those from the reads.

6

Generate plots. The idea is to aggregate all data and plot for all samples together as we are assessing the model rather than the samples.


@iqbal-lab

Outstanding questions

  • Should we also produce a plot of sequence errors at the read level? If so, would the best way to gather this be from CIGAR strings?
  • Are we happy to ignore non-primary alignments for evaluation?
  • How best to visualise BLAST identity vs read length?
  • How best to visualise consensus accuracy?

As mentioned in the beginning, this is a proposal. Discuss anything else you think is missing or not needed.

Subsample PRG "founder" variants

[continuation of #2]
We will no longer use FORGe to select variants to use for our PRG construction [#3]. Instead, we will randomly subsample from lineage-specific VCFs [#11] for lineages with >100 samples. If there are less than 100 samples in a VCF, we include all variants from that VCF.

Generate consensus sequence from variant calls

In order to create a SNP distance matrix using snp-dists we need fasta sequences for each sample's variant calls in a multi-fasta file.

bcftools consensus does not quite do what we need. It doesn't adhere to FILTER and its behaviour at missing/null sites may not be how we like.

@iqbal-lab I know I sound like a broken record, but what base shall we produce in the following circumstance:

  • position is missing in VCF (bcftools outputs the reference position)
  • position is called NULL/. (bcftools output the reference position unless you use the --missing parameter)
  • position fails FILTER
  • position is in mask (bcftools outputs N)

Add Birmingham data to samplesheet

Reminder: samples with no Illumina data

17_616007 illumina data not found
17_613843 illumina data not found
17_616229 illumina data not found
17_616205 illumina data not found
17_616182 illumina data not found
17_619748 illumina data not found
17_619267 illumina data not found
17_609069_c2 illumina data not found
18_620949 illumina data not found

Filter pandora map VCFs

There are two types of filtered VCF that we want to produce here:

  1. Remove "large" things. Effectively we want to restrict to indels of a certain size.
  2. Remove non-SNPs. As this VCF will be used in the concordance/distance comparison with compass and bcftools.

Question for @iqbal-lab : What size indels do we want to restrict to?

Reproducing "truth" Illumina clusters using genome graphs and nanopore

In this section, we will calculate the pairwise SNP distance between all pairs of samples, using the truth set from the baseline variant calling [#4]. We will then do the same using the pandora variant calls [#6] from whichever PRG we decide provides the best results.
The main figure for this subsection will be a dot plot where each dot is a pair and the X- and Y-axes are the pairwise distance for that pair according to the truth set and pandora, respectively. In a perfect world, we would expect to see a dead-straight diagonal line from the bottom-left to top-right. In reality, we hope to see a general linear trend. A linear trend would allow us to state that although specific SNP distance thresholds used to define clusters differ between technologies, they are relative and therefore one can use Nanopore data with pandora to generate epidemiological clusters for Mtb.

Quality control

This epic will be used for aggregating issues/tasks relating to the quality control of samples prior to everything else we do in this project.

Maximum coverage threshold for samples

We should probably set a maximum coverage threshold for samples. A sample with more than this threshold will just be subsampled to the threshold level of coverage. This would be the last step in the QC pipeline to ensure we are subsampling only relevant reads.

100x seems a reasonable idea for nanopore, how about Illumina?

Assign lineages for project samples

This information will not doubt be useful at some point.

For now, probably easiest to use the lineage assignment script from the PRG building and the VCFs from COMPASS.

Organise PHE Birmingham data

Emails from Yifei about this data:

12/11/2018

I have 3 pieces of data for 52 MGITs (50 TB, 1 abscessus, and 1
avium) from Birmingham:

  • Nanopore fastq;
  • Illumina fasta;
  • phenotypic DST;

I will also send you an excel file that links the Nanopore barcode
to Illumina sample ID. These data are currently in Analysis1 and
ResComp. I can put the data in Analysis1 if that works best for you.

14/01/2019

I'm sending the nanopore data for the 52 MGITs (50 TB, 1 abscessus,
and 1 avium), which were from Birmingham lab and sequenced in MMM.

  • overall information - spreadsheet "MGITdata" provides sequencing
    flow cell name, MGIT sample ID, Guuid, illumina results, nanopore
    sequencing stats, and mapping results
  • nanopore fast5 data - directory in analysis1 listed in
    spreadsheet "rawDataDirectory"
  • phenotypic DST - directory in analysis1 listed in spreadsheet
    "rawDataDirectory"
  • illumina fastq - I don't have illumina fastq for these samples,
    only have the Guuid and listed in spreadsheet"MGITdata", ask PHE for
    assistance?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.