veg / flea-pipeline Goto Github PK

View Code? Open in Web Editor NEW

7.0 7.0 1.0 652 KB

A pipeline for long-read sequencing data.

License: Other

Python 54.55% HyPhy 19.40% Julia 2.20% Groovy 22.37% Nextflow 1.48%

flea-pipeline's Issues

Finalize data file formats.

The current nesting in files like sequences.json and trees.json assumes longitudinal sampling. Switch to a flat, general format.
Replace frequencies.json with coordinates.json.
Make a run overview json file.

Shift correction should introduce Ns, not Xs.

BioPython doesn't like some codons with X in them. For instance, translating "XAA" fails but "NAA" gives "X".

obtainEvolutionaryHistory.bf segfaults

...if a timepoint has fewer than 3 consensus sequences.

In 6479e2e I added an exception if this condition occurs, to make the problem more obvious. Really, we should fix the script, though.

Parse logfile and get task running times

parse file
save task runtimes. boxplot or similar visualization
save task start/stop times. plot time line.

HyPhy and fasttree subpipelines do not play well together.

Their input files are not correct

Quantify and summarize results of shift correction

number of changes per sequence
location of the changes
change in diversity of set of sequences

HyPhy scripts rely on parsing copynumber from name

...but now we use a seperate file for copynumbers.

Analysis-only pipeline needs copynumber file

If the user provides an alignment, the pipeline will fail. They need to also provide a copynumber file. Also have the option to assume a copynumber of 1 for each sequence.

copynumber task not waiting for its dependencies

Even though it has three dependencies, sometimes compute_copynumber does not wait for them to complete.

Upgrade dependencies in virtual environment.

Newest version of pysam fails to build in the FLEA virtual environment
flea-env's copy of bam2msa.py does not work. Temporarily using system version.
submit pull request for our version of BioExt to veg/BioExt.
write some tests for bealign.
fix bealign gap extension penalty bug. It should be converted to a float.

alignment diagnosis task not running on cluster

Use homology modeling to generate envelope structure

So insertions relative to reference structure are visible in PV.

filter by length after shift correction

poly-A filtering and alignment to reference steps may shorten a CCS so that it no longer passes the minimum length filter.

Detect this with a new length filter after shift correction, which will ensure the sequences are de-gapped.

Use custom exception types

Result checks currently raise generic exceptions, such as when a result file is empty. Those should be caught and reported by the Result class.

Ensure reference db does not contain stop codons

Benchmark each task

Record the start and stop time of each task. Use them to generate benchmarks and identify slow tasks.

Define Julia dependencies

Julia dependencies are not currently defined.

Current dependencies:

ArgParse
Dates
JSON
DataStructures

poly-A filtering step

After quality filtering, but before alignment.

wrong number of columns in rates.json

Running FUBAR on the PC76 data resulted in different-sized arrays for each timepoint in rates.json. Each should have the same number of columns, equal to the number of columns in the multiple sequence alignment.

usearch identity ignores terminal gaps

This causes short consensus sequences with long insertions or deletions to get overestimated copy numbers.

two-state HMM trims too much of the sequences

A long poly-A head can make the tail trimmer identify the whole sequence as tail, and vice-versa. Use a five-state HMM and Viterbi to implement trimming correctly.

write HMM
do not hardcode parameters
rewrite in Julia or Python

Alignment-free cluster consensus

Currently we are doing a full MSA for each cluster, only to generate the consensus sequence.

usearch cannot handle long filenames

Add a different walltime for long-running tasks.

break alignment pipeline into smaller subpipelines

This will avoid errors like forgetting to pass length-filtered sequences downstream.

quality filtering pipeline
clustering pipeline
copynumber pipeline
msa pipeline

Better names for consensus sequences

Name them after their cluster number, not after a random sequence in the cluster.

More complete logs

Log the following information:

Update turnover script

Latest version takes gaps into account.

More informative diagnosis summaries

Currently the bad column summaries only show the top amino acid in each alignment, but these tend to be the same. Instead show the top three.

run_info.json contains strings, not json objects

replace stop codons before running HyPhy scripts

Unable to copy errors on silverback

Could the qsub_files directory not exist when first task of the pipeline finishes?

Frequencies need to be weighted by copy number

mafft-fftns binary should be defined in config

Until we find a way to completely package the pipeline to include all third party dependencies, we will need to define all binaries used in the config for the sake of proper version control and explicit environment configuration.

Merge HyPhy and fasttree subpipelines

Summarize flow of sequences through the pipeline

For each step, show how many sequences were kept, and how many were discarded.

Compute copynumbers after editing alignment

Deleting HQCSs during alignment editing should be reflected in the copynumbers, but currently it is not because they are done separately. Need to refactor so we do copynumber counting with the results of the edited alignment.

HyPhy scripts fail if fasta record has a name or description

Provide alignment diagnosis to the web app

Package column-wise divergence and amino acid summaries into a json file.

Do not run tasks on the master node

Currently some tasks run locally. They are computationally light, but they should still be rewritten to run remotely.

The following can be done now, as the command-line scripts already exist:

The rest will have to be rewritten as scripts.

Integrate alignment diagnosis

resolve translation of codon-aligned CCSs
better metric: use chi-squared distance or earthmover's distance for each column's counts
improve the plots: distribution of distances
plot all time points combined: box plot for each column

Make shift correction codon-aware

The shift correction algorithm may split and correct within codons. Until we come up with a good general-purpose algorithm, at least we can do shift correction in codon-length chunks.

Make fastq file specification more user-friendly

Currently assumes all input paths are absolute. Handle relative paths.

test command line scripts

Shift correction improvements

print discarded sequences and the reason for the discard
codon grouping with lookahead/behind
use codon scoring matrix in out-of-frame regions.
identify stutters
possibly tolerate more errors in less conserved regions
penalize early stop codons
use base-wise quality scores.
use bealign to do real codon alignment. possibly modify bealign to generate in-frame references.

Prepare package for release

directory structure
setup.py
name all scripts in setup.py
make scripts executable, but do not rely on doctests, because nose ignores excutables.
do not find scripts with file
update flea.config.tpl.

veg / flea-pipeline Goto Github PK

flea-pipeline's People

Contributors

Stargazers

Watchers

Forkers

flea-pipeline's Issues

Recommend Projects

Recommend Topics

Recommend Org