mbhall88 / head_to_head_pipeline Goto Github PK

Snakemake pipelines to run the analysis for the Illumina vs. Nanopore comparison.

License: GNU General Public License v3.0

Python 0.95% Shell 0.02% Jupyter Notebook 55.14% Perl 0.01% HTML 43.88% Jinja 0.01%

tuberculosis snakemake pipeline bioinformatics nanopore illumina drug-resistance transmission clustering diagnostics

head_to_head_pipeline's Introduction

Paper

Hall, M. B. et al. Evaluation of Nanopore sequencing for Mycobacterium tuberculosis drug susceptibility testing and outbreak investigation: a genomic analysis. The Lancet Microbe 0, (2022) doi: 10.1016/S2666-5247(22)00301-9.

This repository holds the pipelines/scripts used for our paper analysing Illumina and Nanopore for M.tuberculosis drug resistance calling and transmission clustering.

For people wanting to analyse their Nanopore data in the same manner as we did in this paper, we would suggest using https://github.com/mbhall88/tbpore, which is a python program that runs the drug resistance prediction and clustering (with a smaller decontamination database) components of this pipeline. It is actively maintained and much easier to use.

All pipelines require the following dependencies to be installed:

Snakemake
Conda (and Mamba)
Singularity
The Python library pandas

See subdirectories for more specific information about different pipelines. They are nested according to their dependence on the outputs of each pipeline.

Quality Control

The following pipelines are not relevant to the work in the final paper.

Data availability

All data is submitted under the Project accession PRJEB49093.

The accessions and all relevant sample metadata for this study can be found at https://doi.org/10.6084/m9.figshare.19304648.

The raw Nanopore data is available to download from: https://ftp.ebi.ac.uk/pub/databases/ont_tb_eval2022/. See the sample metadata file for mappings between samples and the relevant Nanopore runs and barcode numbers.

head_to_head_pipeline's People

Contributors

Stargazers

Watchers

Forkers

kerrimalone leoisl

head_to_head_pipeline's Issues

Run Birmingham data through QC pipeline

Decontamination of reads

Following the process that clockwork uses for decontamination seems the best approach.

~~Shall we set an arbitrary threshold, N, where if a sample has >N% NTM content, we exclude the sample from the project entirely?~~

At the end of this process, we have a fastq for each sample that has just "good quality" reads (i.e. not NTM/Human). This fastq will become the one used downstream for all other aspects of the paper (except maybe DST).

Baseline variant analysis

The truth set of variants for the Illumina data will come from the PHE variant calling pipeline compass.
As a baseline for the Nanopore data, we will run bcftools, with some filtering of variants to remove low-quality calls and with a mask to avoid repetitive and structurally variable regions of the Mtb genome.

The results for this subsection will show concordance of the basic Nanopore variant calling with the Illumina truth set. We will discuss how the different filters trialled affected the results and plot recall against concordance.

Write methods for AMR PRG construction

Variant caller to use for baseline analysis

We initially said we would use clockwork to do the baseline Illumina variant calling. But, I raised the point that using COMPASS may actually be better to use for this paper as this is the current "standard" - as clockwork is not published or in production yet.

Excluding samples based on coverage

Do we want to exclude samples based on coverage? If so, what threshold should we set?

The coverage for the nanopore data can be seen on a per-sample basis for the Malagasy and South African data in the below figures.

TODO

add figures
get coverage information for Illumina data
provide stats on coverage at a project-wide level

Project-wide genome mask

Do we use

Oxford mask
Maha (Max Marin) mask

I prefer Max's mask for various reasons (mostly the transparency of how it was created). But, not sure if you are hesitant to move away from the Oxford one @iqbal-lab?

If we do want to use Max's one I guess we probably need to run it by them first?

Demultiplex Birmingham nanopore data

This script was used to demultiplex all data in the following way.

demux -i guppy_v3.4.5/fastqs -o guppy_v3.4.5/demux -z --trim -t 8

This assumes the default guppy version 3.4.5 and the default barcode kit of EXP-NBD103

Call per-sample variants with pandora

Tasks that fall under this issue:

pandora map looking for de novo variants
update MSAs with discovered sequences
turn updated MSAs into PRGs (make_prg)
index new PRG
pandora map not looking for de novo variants

Normalise pandora map VCFs

Here we want to make the VCFs output from #46 interoperable with the compass and bcftools VCFs that are based on H37Rv. The main task here really is just adjusting the POS column to the corresponding position on H37Rv - ensuring information about the local PRG name and coordinates are retained in the INFO field.

Filter nanopore baseline variant calls

@iqbal-lab are there standard filters you would like to apply? If not, how do you suggest we go about determining what filters to use?

Apply variants to H37Rv reference chunks

For the dense and sparse H37Rv PRGs [#3] we need to add a certain amount of variants to our PRG. After we have ranked the variants with FORGe [#2] we can subset this ranking into dense and sparse.
Then, for every variant in the ranking, we take the reference sequence for the "chunk" [#9] that position falls in and apply the variant to it. We then add the mutated sequence to a fasta file for that chunk.

Clarify usage of pandora compare

This issue is for discussing the role of pandora compare in this paper (if it does play a role?).

@iqbal-lab will we also run the samples through compare? If so, what questions, specifically, are we trying to answer with it? Is it the same as the per-sample, but just trying to see if compare gives better results?

Run baseline variant calling for nanopore data

The baseline variant calling for nanopore will be done with bcftools but using some filtering that will need experimenting.

Basic analysis of per-sample variant calls

The aim of this subsection will be to try varying degrees of PRG complexity for Mtb sample analysis. Refer to #3 for details on construction.

For each of the PRGs, we plan to perform the following analysis. Quantify the number of SNPs and indels pandora calls per-sample, on average, and see how this compares to the baseline [#4] and truth. We will report on the concordance rate, which is the proportion of shared sites between pandora and the truth that agree.

Multiple sequence alignment of "founder" variants

After adding the founder variants to their respective H37Rv chunks[#10], run a MSA program (clustalo?) on each chunk.

Run Birmingham data through COMPASS

Lineage-specific VCFs

Split the large CRyPTIC VCF, with all samples in it, into lineage-specific VCFs to make variant ranking easier [#2].

Convert COMPASS gVCFs to VCFs

The gVCFs will be kept in case they are wanted in the future, but for now, we only need VCF.

Drug susceptibility prediction

Validate concordance with mykrobe for Illumina and Nanopore. In particular, the analysis will focus on the (hopefully) lower coverage required by pandora to achieve the same, or better, results as mykrobe, and the increased detection power provided by de novo variant discovery.

Produce QC report

At the end of the QC pipeline a report with the following information would be useful

Fraction of reads mapped to contaminants (per-sample)
Fraction of reads that did not map to the decontamination database (per-sample)
krona plots showing sample composition (per-sample)
Coverage plots for final fastq
Lineage information of samples

Convert single-fast5 to multi-fast5

Write methods for baseline variant analysis

Construct PRGs

We plan to try out a variety of PRGs in this project

~~Linear PRG - This will just be the standard M. tuberculosis reference H37Rv~~
Sparse PRG - H37Rv and all variants at frequency above 10%, from CRyPTIC
Dense PRG - The same as Sparse PRG, but with variants at frequency above 1%.
~~Representative PRG - This will contain two high-quality genomes from each lineage, plus all variants from the above dataset at >=5% frequency.~~

Concordance of baseline variant callers

We will look at two different kinds of concordance:

Positions where COMPASS says ALT, how often does bcftools on nanopore agree?
At all positions where COMPASS "makes a statement", how often does bcftools on nanopore agree?

Use FORGe to rank variants

FORGe is a tool for ranking variants and building an optimal graph genome. We are specifically interested in using it to help us select the variants we want to use for constructing our various PRGs.

SNP distance between Illumina and Nanopore calls

One of the main figures for the paper will show a scatter plot of pairwise SNP distances. The two axes will be Illumina SNP distance and Nanopore SNP distance for each pair of samples. If clustering with nanopore is interchangeable with Illumina clustering we would hope to see a perfect diagonal line from the bottom left to top right indicating the distance between samples is the same for both technology's SNP calls. Failing this, we would hope for a linear relationship of some sort which would allow us to provide a recommendation for SNP thresholds when using Nanopore.

Basecall Birmingham nanopore data

All data is basecalled using this script and run using default parameters. That is, guppy version 3.4.5 and basecalling model dna_r9.4.1_450bps_hac_prom.cfg

Write methods for QC

Investigate concordance outliers

From this plot of call rate against concordance, we can see there are 5 samples that are clear outliers when it comes to concordance. These 5 samples are:

mada_2-46
R27006
R13303
mada_1-34
mada_1-50

I looked at a random selection of VCF positions where these samples made incorrect calls and could not see any discernable metric causing this low concordance.

For the following plots I extracted all positions (for each of the 5 samples) where bcftools made an incorrect (false) call and compared those to positions where it made a correct (true) call

Strand bias

There does not appear to be any enrichment for strand bias in the false calls.

VCF QUAL score

Note: the false calls have a higher QUAL due to the fact that most did not have an ALT, hence the QUAL calculation is slightly different, yielding a higher QUAL score. See here for elaboration on how the calculation is different.
The main takeaway here is that if we were to raise the minimum QUAL filtering cutoff to 30, we would remove the smaller peak in the false calls. This would, of course, be at the expense of a drop in call rate. @iqbal-lab do you think this is an acceptable change to the filtering?

Median depth

Strangely, the false calls seem to be higher depth. I'm not sure there is much that can be done here.

Conclusion

There does not appear to be any simple fix with respect to filters that can explain the low concordance of these samples.

Next thing

Plot the concordance for each of these samples against all other samples to see if they may be sample swaps. For instance, is mada_2-46 has better concordance with a different sample, it is quite likely it has been swapped and we may need to exclude these samples.

Assign lineages

Assign lineages to the samples in the cryptic ~15000 sample VCF. This is important to allow us to split the VCF up into lineage VCFs to ensure we don't "drown" out the under-represented lineages when we build the PRGs [#3].

Split H37Rv based on annotation

We will base our local PRGs on genes and intergenic regions from the H37Rv annotation.

Evaluate basecalling model

(Proposed) Methods for evaluating the Mtb-specific guppy basecalling model.

1

Create a file mapping read IDs (for the evaluation reads) to the sample they come from.

2

Split the evaluation reads for each basecalling model into per-sample fastq files. This will result in two fastq files for each sample.

Note: There is an inherit acceptance in this that the barcode assignment from the original guppy model is to be taken as "truth"

3

Map each fastq to the reference assembly for that sample, using minimap2 (output PAF).
Non-standard flags to use:

-c generate CIGAR
-2 use two IO threads
--secondary=no do not output secondary alignments

Q: Are we happy to ignore unmapped, secondary, and supplementary alignments? I think we should, but better to be explicit with this decision.

4

Calculate the 'BLAST identity' for each alignment. In PAF format, this is achieved by dividing column 10 (Number of matching bases in the mapping) by column 11 (Number bases, including gaps, in the mapping). BLAST identity is more suitable than gap-compressed identity for this assessment as we want the length of indels to affect the identity metric.

5

Generate assemblies with Rebaler. Ryan specifically created Rebaler for assessing consensus accuracy when comparing basecalling models. It is effectively reference-guided assembly with some Racon polishing thrown in. i.e. uses the structure of the original assembly and replaces the bases with those from the reads.

6

Generate plots. The idea is to aggregate all data and plot for all samples together as we are assessing the model rather than the samples.

Violin plot of per-read BLAST identity.
BLAST identity vs read length. Density plots?
Relative read length as violin plot
Consensus accuracy. There are two ways I can conceive of visualising this. Either as a violin plot, where each data point is a 10Kb chunk of the assembly or as a single consensus value.
Consensus sequence error types. This plot.

@iqbal-lab

Outstanding questions

Should we also produce a plot of sequence errors at the read level? If so, would the best way to gather this be from CIGAR strings?
Are we happy to ignore non-primary alignments for evaluation?
How best to visualise BLAST identity vs read length?
How best to visualise consensus accuracy?

As mentioned in the beginning, this is a proposal. Discuss anything else you think is missing or not needed.

Subsample PRG "founder" variants

[continuation of #2]
We will no longer use FORGe to select variants to use for our PRG construction [#3]. Instead, we will randomly subsample from lineage-specific VCFs [#11] for lineages with >100 samples. If there are less than 100 samples in a VCF, we include all variants from that VCF.

Generate consensus sequence from variant calls

In order to create a SNP distance matrix using snp-dists we need fasta sequences for each sample's variant calls in a multi-fasta file.

bcftools consensus does not quite do what we need. It doesn't adhere to FILTER and its behaviour at missing/null sites may not be how we like.

@iqbal-lab I know I sound like a broken record, but what base shall we produce in the following circumstance:

position is missing in VCF (bcftools outputs the reference position)
position is called NULL/. (bcftools output the reference position unless you use the --missing parameter)
position fails FILTER
position is in mask (bcftools outputs N)

Run COMPASS on Illumina data

Add Birmingham data to samplesheet

Reminder: samples with no Illumina data

17_616007 illumina data not found
17_613843 illumina data not found
17_616229 illumina data not found
17_616205 illumina data not found
17_616182 illumina data not found
17_619748 illumina data not found
17_619267 illumina data not found
17_609069_c2 illumina data not found
18_620949 illumina data not found

Filter pandora map VCFs

There are two types of filtered VCF that we want to produce here:

Remove "large" things. Effectively we want to restrict to indels of a certain size.
Remove non-SNPs. As this VCF will be used in the concordance/distance comparison with compass and bcftools.

Question for @iqbal-lab : What size indels do we want to restrict to?

Rsync Birmingham data to NFS and FTP

Move Birmingham Illumina data to project directory

Reproducing "truth" Illumina clusters using genome graphs and nanopore

In this section, we will calculate the pairwise SNP distance between all pairs of samples, using the truth set from the baseline variant calling [#4]. We will then do the same using the pandora variant calls [#6] from whichever PRG we decide provides the best results.
The main figure for this subsection will be a dot plot where each dot is a pair and the X- and Y-axes are the pairwise distance for that pair according to the truth set and pandora, respectively. In a perfect world, we would expect to see a dead-straight diagonal line from the bottom-left to top-right. In reality, we hope to see a general linear trend. A linear trend would allow us to state that although specific SNP distance thresholds used to define clusters differ between technologies, they are relative and therefore one can use Nanopore data with pandora to generate epidemiological clusters for Mtb.

Index H37Rv PRG

For the variant analysis with pandora

Compare pandora calls to truth assemblies

This analysis is analogous to #42. However, we will check the SNP-only VCF and the "full" VCF.

Quality control

This epic will be used for aggregating issues/tasks relating to the quality control of samples prior to everything else we do in this project.

Build local PRGs

Run make_prg on each multiple sequence alignment produced from #15.

Compare variant calls to truth assemblies

Whilst we are looking at concordance of Nanopore variant calls to the COMPASS Illumina calls, it is also important to remember that COMPASS itself is not necessarily "truth".

As we have high-quality PacBio CCS assemblies for 8 samples, we can compare both the Illumina and Nanopore calls to the truth to get a sense of the trade-offs. To do this we will use varifier.

Maximum coverage threshold for samples

We should probably set a maximum coverage threshold for samples. A sample with more than this threshold will just be subsampled to the threshold level of coverage. This would be the last step in the QC pipeline to ensure we are subsampling only relevant reads.

100x seems a reasonable idea for nanopore, how about Illumina?

12/11/2018

I have 3 pieces of data for 52 MGITs (50 TB, 1 abscessus, and 1
avium) from Birmingham:

Nanopore fastq;

Illumina fasta;

phenotypic DST;

I will also send you an excel file that links the Nanopore barcode
to Illumina sample ID. These data are currently in Analysis1 and
ResComp. I can put the data in Analysis1 if that works best for you.

14/01/2019

I'm sending the nanopore data for the 52 MGITs (50 TB, 1 abscessus,
and 1 avium), which were from Birmingham lab and sequenced in MMM.

overall information - spreadsheet "MGITdata" provides sequencing
flow cell name, MGIT sample ID, Guuid, illumina results, nanopore
sequencing stats, and mapping results

nanopore fast5 data - directory in analysis1 listed in
spreadsheet "rawDataDirectory"

phenotypic DST - directory in analysis1 listed in spreadsheet
"rawDataDirectory"

illumina fastq - I don't have illumina fastq for these samples,
only have the Guuid and listed in spreadsheet"MGITdata", ask PHE for
assistance?