Giter VIP home page Giter VIP logo

flea-pipeline's People

Contributors

kemaleren avatar pditommaso avatar stevenweaver avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

chudym2

flea-pipeline's Issues

Finalize data file formats.

  • The current nesting in files like sequences.json and trees.json assumes longitudinal sampling. Switch to a flat, general format.
  • Replace frequencies.json with coordinates.json.
  • Make a run overview json file.

obtainEvolutionaryHistory.bf segfaults

...if a timepoint has fewer than 3 consensus sequences.

In 6479e2e I added an exception if this condition occurs, to make the problem more obvious. Really, we should fix the script, though.

Upgrade dependencies in virtual environment.

  • Newest version of pysam fails to build in the FLEA virtual environment
  • flea-env's copy of bam2msa.py does not work. Temporarily using system version.
  • submit pull request for our version of BioExt to veg/BioExt.
  • write some tests for bealign.
  • fix bealign gap extension penalty bug. It should be converted to a float.

filter by length after shift correction

poly-A filtering and alignment to reference steps may shorten a CCS so that it no longer passes the minimum length filter.

Detect this with a new length filter after shift correction, which will ensure the sequences are de-gapped.

Use custom exception types

Result checks currently raise generic exceptions, such as when a result file is empty. Those should be caught and reported by the Result class.

Benchmark each task

Record the start and stop time of each task. Use them to generate benchmarks and identify slow tasks.

Define Julia dependencies

Julia dependencies are not currently defined.

Current dependencies:

  • ArgParse
  • Dates
  • JSON
  • DataStructures

wrong number of columns in rates.json

Running FUBAR on the PC76 data resulted in different-sized arrays for each timepoint in rates.json. Each should have the same number of columns, equal to the number of columns in the multiple sequence alignment.

two-state HMM trims too much of the sequences

A long poly-A head can make the tail trimmer identify the whole sequence as tail, and vice-versa. Use a five-state HMM and Viterbi to implement trimming correctly.

  • write HMM
  • do not hardcode parameters
  • rewrite in Julia or Python

More complete logs

Log the following information:

  • infiles
  • stdout
  • stderr
  • external commands
  • whether the task failed
  • exceptions and tracebacks

More informative diagnosis summaries

Currently the bad column summaries only show the top amino acid in each alignment, but these tend to be the same. Instead show the top three.

mafft-fftns binary should be defined in config

Until we find a way to completely package the pipeline to include all third party dependencies, we will need to define all binaries used in the config for the sake of proper version control and explicit environment configuration.

Merge HyPhy and fasttree subpipelines

  • generate frequencies.json and turnover.json
  • change file locations for evo history script
  • add evo history script to pipeline
  • change file locations for fubar script
  • Add FUBAR script to generate rates.json
  • rename subpipeline
  • refactor main pipeline file
  • remove old subpipeline
  • add subdirectories for each subpipeline
  • ensure new pipeline works for short amplicons, or at least have an option to disable parts that do not.
  • update config file
  • copynumbers in evo history script

Compute copynumbers after editing alignment

Deleting HQCSs during alignment editing should be reflected in the copynumbers, but currently it is not because they are done separately. Need to refactor so we do copynumber counting with the results of the edited alignment.

Do not run tasks on the master node

Currently some tasks run locally. They are computationally light, but they should still be rewritten to run remotely.

The following can be done now, as the command-line scripts already exist:

  • trim heads/tails
  • shift correction
  • translate
  • backtranslate
  • consensus

The rest will have to be rewritten as scripts.

Integrate alignment diagnosis

  • resolve translation of codon-aligned CCSs
  • better metric: use chi-squared distance or earthmover's distance for each column's counts
  • improve the plots: distribution of distances
  • plot all time points combined: box plot for each column

Make shift correction codon-aware

The shift correction algorithm may split and correct within codons. Until we come up with a good general-purpose algorithm, at least we can do shift correction in codon-length chunks.

Shift correction improvements

  • print discarded sequences and the reason for the discard
  • codon grouping with lookahead/behind
  • use codon scoring matrix in out-of-frame regions.
  • identify stutters
  • possibly tolerate more errors in less conserved regions
  • penalize early stop codons
  • use base-wise quality scores.
  • use bealign to do real codon alignment. possibly modify bealign to generate in-frame references.

Prepare package for release

  • directory structure
  • setup.py
  • name all scripts in setup.py
  • make scripts executable, but do not rely on doctests, because nose ignores excutables.
  • do not find scripts with file
  • update flea.config.tpl.

MRCA may not be in-frame

This should not happen, because the nucleotide alignment from which it is generated is a backtranslated amino-acid alignment. Possibly the backtranslation method is wrong.

MRCA looks wrong in UMASS short dataset

The majority of sequences in the earliest timepoint have a 7 bp insertion after HXB2 coordinate 132. For some reason this insertion does not appear in the MRCA.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.