cfe-lab / micall Goto Github PK

View Code? Open in Web Editor NEW

14.0 12.0 9.0 28.92 MB

Pipeline for processing FASTQ data from an Illumina MiSeq to genotype human RNA viruses like HIV and hepatitis C

Home Page: https://cfe-lab.github.io/MiCall

License: GNU Affero General Public License v3.0

Python 98.89% HTML 0.59% JavaScript 0.04% Dockerfile 0.23% Singularity 0.26%

bioinformatics fastq python resistance genotype

micall's Introduction

MiCall

Processing FASTQ data from an Illumina MiSeq

Maps all the reads from a sample against a set of reference sequences, then stitches all the reads into consensus sequences and coverage maps.

A monitoring system regularly checks the file system for unprocessed runs, transfers FASTQ.gz files to the cluster and executes the pipeline.

See the list of steps and files for details of what the pipeline does. The admin page describes how to look after the pipeline in Kive, and the getting started page describes how to get the docker version set up and run it on your own data.

Dual Licensing

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License along with this program. If not, visit gnu.org. The source code for this program is available from github.com.

The program is also available for a fee under a more permissive license. For example, if you want to run a changed version of the program on a network server without publishing the changed source code, contact us about purchasing a license.

Third Party Components

MiCall makes use of several open-source tools. Here is a list of tools with their licenses.

Requests is distributed under the Apache 2.0 license.

Python 3 is distributed under the Python 3 license.

Bowtie2, IVA, and Python-Levenshtein are distributed under the GNU General Public License (GPL).

Matplotlib is distributed under the Matplotlib license.

Reportlab is distributed under the BSD license.

Pyyaml and Cutadapt are distributed under the MIT license.

micall's People

Contributors

Stargazers

Watchers

Forkers

tarah28 poonlab mavino rng64 richardharrigan pastvir

micall's Issues

Gaps in pileup break pileup_to_conseq

I tried to run the pipeline on two fastq files from the 140522 run. The two files I ran were 46824A-3515-HLA-B-E99601CLIMX-PR-RT_S23_L001_R1_001.fastq and 57425Ad20-3515-HLA-B-E99608CLIMX-PR-RT_S12_L001_R1_001.fastq.

I get the following error:

non-zero exit code '1' from command 'python2.7 STEP_1_MAPPING.py ./reference_sequences/cfe /mnt/data/don/data/RAW_DATA/140522/46824A-3515-HLA-B-E99601CLIMX-PR-RT_S23_L001_R1_001.fastq 20 Nextera False 0.95 3 4'

When I look in the mapping log, I see the exception:

Traceback (most recent call last):
  File "STEP_1_MAPPING.py", line 15, in <module>
    miseq_modules.mapping(mapping_ref_path, fastq, consensus_q_cutoff, mode, is_t_primer, min_mapping_efficiency, max_remaps, bowtie_threads)
  File "/home/don/git/MiseqPipeline/miseq_modules.py", line 666, in mapping
    samfile, confile = remap(R1_fastq, R2_fastq, samfile, refpath, original_reference, conseq_qCutoff, num_threads)
  File "/home/don/git/MiseqPipeline/miseq_modules.py", line 497, in remap
    confile = pileup_to_conseq(pileup_file, conseq_qCutoff)
  File "/home/don/git/MiseqPipeline/miseqUtils.py", line 858, in pileup_to_conseq
    token = intermed[0][1]
IndexError: list index out of range

With the help of some debug statements, I tracked it to line 239 of 46824A-3515-HLA-B-E99601CLIMX-PR-RT_S23.HIV1B-env.bam.pileup. There are no reads piled up against position 286, and pileup_to_conseq() assumes there will always be at least one read.

micall: Broken HCV Coverage Plotting

Micall: Some samples have a NS3 gene coverage plot whereas others have just a broken image icon suggesting an issue with the R plotting script. Example would be WD Jan24 HCV run. Sample 51393a has a broken plot image while it's neighbour Sample 51313a produces a plot but no coverage line is plotted, 51378a does have a good plot.

Sort order of runs and users

Runs should be in reverse chronological order so that the most recent run is at the top, and selected by default.
Users should be sorted alphabetically, case insensitive.
Pipeline versions should be sorted in reverse, newest first.

Disabling cross-contamination filter as of version 6.2

Until we can evaluate its effect on results, it's just slowing things down.

Link to MiseqPipeline GitHub

Provide a feedback link on the micall page to create issues in this repository.

Combine coverage maps

Clicking on all the sample/region combinations in a run to see coverage maps isn't practical. Does it make sense to combine all the coverage maps for a region into one so you can see which samples are not well covered? Should we use coverage to assign the table colours instead of simple mapped counts?
Chanson had some ideas for different ways to display coverage.

Change reference sequences to PR, RT, INT, V3

As required by laboratory. Research-based runs will need to use custom reference sequence sets if reads span broader intervals of the genome.

X axis label incorrect on coverage maps

If the region is HCV or HLA, the x axis is still labeled HXB2.
Maybe use "reference amino acid position".
Check that HLA can actually produce coverage maps. WD's run on 25-Apr-2014 didn't seem to have any coverage maps for HLA regions.

Reduce size of coverage map images

We had discussed shrinking them to half the size.
Right now, they are so large that they force a large gap between the count table and the region table. However, it's possible that transposing the region table will leave enough room that we can keep the large coverage map.

Indicate key positions in coverage plots

Sample and region will be failed on the basis of insufficient coverage at key positions of the respective gene (for example, at codon positions where there are known resistance mutations).

nuc files absent from 140522_M01841_0063_000000000-A64FB Results

These files are specifically required for HLA data.
Run was annotated as Nextera mode; csf2nuc required Amplicon mode.
Changed this in development, will push to production.
Manually reprocessed CSF files and posted nuc files to RAW_DATA.

Create test set of FASTQ data for development

Should contain problematic samples.

Empty freqs files for F00061 in A64EA

In run A64EA, for sample F00061, coverage maps are failing to generate due to missing freq files. On deeper inspection, in miseq_modules.csf2counts, qindex_to_refcoord is not being populated, so the amino/nuc freq files are not being created.

v3prot files are created successfully with sensible outputs (This depends on upstream csf files)
F00062-V3LOOP_S30.HIV1B-env.20.csf contains data with normal-looking offsets and sequences (Sample data shown below)

I can't obviously spot problems in the csf data F00062.

sampleSheetParser needs to handle windows line endings

For some reason, some of the SampleSheet.csv's have windows line endings "\r\n" instead of unix line endings "\n".

See /mnt/RAW_DATA/MiSeq/runs/131002_M01841_0026_000000000-A5EPM/SampleSheet.csv for an example. Notepad++ (runs on windows) has an option that will show you the line ending characters.

In order to handle this, we need to replace code that strips newlines for code that strips any whitespace at the end of the line.

EG) miseqUtils.sampleSheetParser()
tag = line.strip('\n').rstrip(',').strip('[]')

should be

tag = line.rstrip().rstrip(',').strip('[]')

Add link to micall page

Should be somewhere in the main QAI menu down the left side.

Add review process

The user can override the pass/fail decision that the pipeline made for each sample/region, as well as for an entire sample or region. Display a check mark or X for manual pass or fail.
The user can also mark the run as reviewed. Confirm if any samples were overridden or if the run has already been reviewed.
Log who reviewed the run and when, along with which samples were overridden.
Create another issue for reviewing after a new version of the pipeline has rerun a run that you reviewed. Should highlight differences with the previous run by showing question marks.

Display an info box when you click on a sample region. The box displays the sample name, the region name, the raw counts, mapped counts, minimum coverage, minimum coverage position.
Add pass / fail radio buttons to the info box that show a cross or check mark.
Use short cut keys: (c)heck, (x)cross, and (z)blank. Codes 67, 88, and 90.
Save to the database when a sample is reviewed.
Add buttons to accept pipeline decisions on a whole row or column.
Add a review button at the bottom, along with a display of who reviewed and when.
Confirm with the user if any samples were overridden.
Show history of pipeline runs and reviews for a sample below the info box.
Create issue for reviewing after a new version of the pipeline has run.

Link to FASTA consensus for HLA regions

Provide a way for a user to see consensus sequences from the HLA regions for any samples with coverage. They should be able to see it in the browser, or download it to a file.
This is similar to issue #73, but HLA regions don't get a consensus sequence of amino acids. We will need to add a new step to build a consensus sequence of nucleotides for the HLA regions.
Turned out this issue was invalid. We were already generating consensus sequences for all regions.

need to generate coverage maps for regions not covered by csf counts

Specifically HLA regions

Fontconfig error when running locally

2014-04-28 13:57:45.589073 - [INFO] python2.7 generate_coverage_plots.py /Volumes/Crawlspace/miseq/140214_M01841_0057_000000000-A64L9/amino_cleaned_frequencies.csv /Volumes/Crawlspace/miseq/140214_M01841_0057_000000000-A64L9/coverage_maps
Fontconfig error: Cannot load default config file
Fontconfig error: Cannot load default config file

bowtie2 crashes with page fault on filter_cross_contaminants

sampleSheetParser fails to parse combined sample names

2014-03-31 15:01:57.857023 - [DEBUG] sampleSheetParser(<open file '/data/miseq/140214_M01841_0057_000000000-A64L9/SampleSheet.csv', mode 'rU' at 0xd934b70>)
2014-03-31 15:01:57.860683 - [ERROR] 50940ARPT-HCV-49541A-V-PR-RT_S14 not in SampleSheet.csv - cannot initiate mapping for this sample
2014-03-31 15:01:57.861117 - [ERROR] 50931A-HCV-49514A-V-PR-RT_S20 not in SampleSheet.csv - cannot initiate mapping for this sample
2014-03-31 15:01:57.861363 - [ERROR] 50966A-HCV-49532A-V-PR-RT_S9 not in SampleSheet.csv - cannot initiate mapping for this sample

Blank coverage maps

Some coverage maps don't show any counts, even though the mapped counts are not zero.

For example, user AG's run from 2 Oct 2013, doesn't have any good coverage maps that I could find. I tried clicking on the first five samples in the HIV1B-env region, and they were all blank.

Run 140205 eats up all RAM, swap on head node

This seems to affect this particular run.
For now, skipping this run in production (version 6) re-processing.

2014-05-15 19:07:33.576896 - [INFO] pID 65792: bpsh 0 python2.7 STEP_2_SAM2CSF.py /data/miseq/140205_M01841_0054_000000000-A64DU/F00997-V3LOOP_S3.HIV1B-env.remap.sam 15 10 Amplicon 0.5
2014-05-16 00:56:52.707074 - pid 65746 returned non-zero exit code '255' from command 'python2.7 STEP_2_SAM2CSF.py /data/miseq/140205_M01841_0054_000000000-A64DU/F01446-V3LOOP_S12.HIV1B-env.remap.sam 0 10 Amplicon 0.5'
2014-05-16 05:26:59.331704 - pid 65748 returned non-zero exit code '255' from command 'python2.7 STEP_2_SAM2CSF.py /data/miseq/140205_M01841_0054_000000000-A64DU/F01446-V3LOOP_S12.HIV1B-env.remap.sam 10 10 Amplicon 0.5'

Add entry for all users

Should be first in the list and the default. That way the latest run will appear by default.

Review colour scheme for region table

Look in the sample sheet to see which regions are expected for each sample. Use colours for expected regions, and grey scale for unexpected regions.
Choose different threshold counts for the colour scale in each organism.

Purge most records in the MISEQQC_QUALITYMETRICS table

Here's a summary of the e-mail discussion:

MISEQQC_QUALITYMETRICS currently has 40,611,200 rows. The table has the following columns:

runid

lane

tile

cycle

Q_bin

numclusters

Chanson is 99% sure this means:

Q_bin: quality score

NumClusters: Number of bases observed in this run/lane/tile/cycle with
a quality score == Q_bin

I don't see why there's a need for 50 Q_bins as we've only ever observed
quality scores between 12-40. So if we only retain Q_bins where NumClusters>0, we're left with only 57% of the data
If we then retain only cycles 50, 260, 500 (260 instead of 250 since it
should fall in the Index reads) we'd then only be retaining <0.5% of the
data. ~4000 rows/run is manageable, right?
Richard said that 4000 per run is certainly manageable; as is 10,000 (which would mean keeping ALL the numcluster values. So let's just eliminate all values where cycles are <> 50, 260 or 500 and declare victory (for now)

noalign*.fastq files are eating up disk space

These get generated by remap() where all reads that failed to map to a region-specific reference get tossed into a new FASTQ file. This generates a lot of redundant garbage where 99% of the raw data map to one region - other regions with minimal representation in the sample will create noalign files that are roughly equivalent to the original FASTQ.
Erased all noalign.fastq files on the cluster to free up nearly half of disk space.
Changed bowtie2 flags in remap() so that these files are no longer being generated - push to dev branch, assess for next production version.

Use arrow keys to navigate between regions

Update the selected region by using the arrow keys. Experiment to see if that's easy to implement. Should it skip grey squares?

Shorten hover delay

Use Javascript to display counts and sample names immediately when you hover over a table cell.

Click for cross reference

Provide some way to cross reference between the two tables. Clicking on a count highlights the corresponding row in the region table?

No g2p results in production

Forgot to compile alignment.c using build_alignment.sh

Clear page before sending Ajax requests

When the response is slow or there's an error, you need some feedback that something happened. Either clear the tables, or display a "Loading..." message.

amino acid frequencies

Run 140214 amino freq file data doesn't look right.

Y:\MiSeq\runs\140214_M01841_0057_000000000-A64L9\Results\version_6.1

Need amino freq for sample 49537A-INT. This is a HIV integrase sample.

HCV AA Freq problem

Hi,

While evaluating HCV runs for polymorphism at AA site 80 in the NS3 gene i found a troubling issue. As the quality score increases above 15 the AA freq goes from being realistic (most all data at Q or K) to unrealistic (seemingly evenly dispersed across all AA at position 80).

Example:
Lines 105-107 are realistic. While lines 108-111 do not seem to be. Likely a result of NA creeping into the data as Q scores increase thus spreading the data across AA?

105 51204A-D1-HCV_S1.HCV1A-H77-NS3.0.amino.freqs 79 80 1 0 0 3 0 0 0 2 2387 0 1 6 0 14 3 0 0 0 0 0 0 2417
106 51204A-D1-HCV_S1.HCV1A-H77-NS3.10.amino.freqs 79 80 1 0 0 3 0 0 0 2 2387 0 1 6 0 14 3 0 0 0 0 0 0 2417
107 51204A-D1-HCV_S1.HCV1A-H77-NS3.15.amino.freqs 79 80 1 0 0 3 0 0 0 1 2262 0 1 3 0 9 2 0 0 0 0 0 0 2282
108 51204A-D1-HCV_S1.HCV1A-H77-NS3.20.amino.freqs 668 80 100 17 161 36 31 76 154 69 30 129 3 14 84 105 93 136 68 56 2 5 14 1383
109 51204A-D1-HCV_S1.HCV1A-H77-NS3.25.amino.freqs 668 80 100 17 158 34 31 72 152 68 30 125 3 14 83 102 88 135 66 54 2 5 13 1352
111 51204A-D1-HCV_S1.HCV1A-H77-NS3.35.amino.freqs 735 80 71 20 26 17 11 28 5 28 55 83 32 5 60 88 251 184 55 57 7 16 11 1110

Tighten match criteria for HLA

When mapping to HLA, do not iterate to build a consensus sequence, just map to the reference sequence.

Coverage map should not overlap tables

Add a spacer so the browser will add a scroll bar before the coverage map overlaps the tables.

coverage_plots.R not functioning on cluster

[art@Bulbasaur development]$ Rscript coverage_plots.R /data/miseq/140522_M01841_0063_000000000-A64FB/amino_frequencies.csv /data/miseq/140522_M01841_0063_000000000-A64FB/coverage_maps/
Error in plot.window(...) : need finite 'xlim' values
Calls: plot -> plot.default -> localWindow -> plot.window
In addition: Warning message:
In max(df$refseq.aa.pos) : no non-missing arguments to max; returning -Inf
Execution halted

Transpose Region Table

Use the regions as columns and the samples as rows so the table won't be so wide.

Counts are way too low on 22 May 2014 run

Winnie was expecting to see HLA-B on all samples except N713 column.
The samples with long names should have both HLA-B and several regions of HIV, however they seem to have not mapped anything.
The raw counts seem really low.

we really need a meta-progress monitor

so that when the lab runs the MiSeq, that run is queued and we know how long it has been unprocessed (until the pipeline generates results).

Support Firefox

Just need to add another CSS style to rotate the column headers.

g2p scoring appears to be broken

2014-05-29 16:23:15.811628 - [INFO] Reading /Volumes/Crawlspace/miseq/131002_M01841_0026_000000000-A5EPM/F00021-V3LOOP-2_S82.HIV1B-env.0.v3prot
2014-05-29 16:23:15.811970 - [WARNING] miseqUtils.prop_x4() threw exception 'convert_fasta threw exception 'local variable 'h' referenced before assignment''

.nuc file and amino acid/nucleotide/indel counts wrong when mates do not overlap

Seen in Commit 0278781:

When mates do not overlap, the space in between them is filled with '-' gap characters in miseqUtils.merge_pairs().

However, when the merged sequence is written to .csf file in miseq_modules.sam2csf_with_base_censoring(), all '-' gap characters are removed.

This becomes problematic when the .csf files are parsed in miseq_modules.csf2counts() to obtain metrics such as amino acid counts by position, nucleotide counts by position, indel counts by position. The 2nd mate is shifted towards the left in position, which screws up the counts by position. It also screws up the amino acid translation for the 2nd mate.

This also removes the space between mates for the .nuc fasta files in miseq_modules.csf2nuc()

Possible fix: in miseqUtils.merge_pairs(), do not use '-' gap characters in between mates. Instead use 'N' characters or some special character designated for the space in between mates. Using 'N' characters will cause the max_prop_n threshold to fail more often for non-overlapping mate-pair reads.

No results generated for HLA samples

Because we have no HLA-A, -B, or -C references in csf2counts
This is tricky because there is no meaningful reading frame - the region being sequenced comprises two exons separated by an intron.

bug in miseq_modules.csf2nuc()

2014-06-03 16:32:53.621166 - [INFO] pid 91940: miseq_modules.csf2nuc(/data/miseq/140522_M01841_0063_000000000-A64FB/57322A-1015-HLA-B_S40.HLA-C.0.csf,/usr/local/share/miseq/production/reference_sequences/csf_to_fasta_by_nucref.csv)
Traceback (most recent call last):
  File "STEP_5_CSF2NUC.py", line 9, in <module>
    miseq_modules.csf2nuc(file, ref_path)
  File "/usr/local/share/miseq/production/miseq_modules.py", line 138, in csf2nuc
    csf = parse_csf(f, mode)
NameError: global name 'mode' is not defined

Version 6.2 breaks micall Q-cutoff dropdown

I changed the filename convention for PNG files, so they no longer include the Q cutoff, e.g., 46460A-3515-HLA-B-E778109KB-PR-RT_S9.HIV1B-gag.png

Grid lines on region table

It's hard to follow from a sample name or region name to a count square. Try adding grid lines or using a light grey colour on the empty squares.

Nucleotide frequency files don't make sense

See 131002_M01841_0026_000000000-A5EPM nucleotide_frequencies.csv.
amino_frequencies.csv looks OK.

Colour of coverage maps is inconsistent

Some coverage maps have blue shading beneath the curve, but it's not consistent.
For example, user RM's run from 15 Apr 2014 has blue shading for all the regions of its first sample, but user PW's run from 12 Feb 2014 does not. I looked at pipeline version 6.1 for both runs.

micall: HCV Coverage Plots

Hi,

One more somewhat minor issue with the R plots. The x-axis for the HCV plots is labeled - HXB2 amino acid position. Probably should be HCV H77 reference position?

Cheerios!