veg / flea-pipeline Goto Github PK
View Code? Open in Web Editor NEWA pipeline for long-read sequencing data.
License: Other
A pipeline for long-read sequencing data.
License: Other
sequences.json
and trees.json
assumes longitudinal sampling. Switch to a flat, general format.frequencies.json
with coordinates.json
.BioPython doesn't like some codons with X in them. For instance, translating "XAA" fails but "NAA" gives "X".
...if a timepoint has fewer than 3 consensus sequences.
In 6479e2e I added an exception if this condition occurs, to make the problem more obvious. Really, we should fix the script, though.
Their input files are not correct
...but now we use a seperate file for copynumbers.
If the user provides an alignment, the pipeline will fail. They need to also provide a copynumber file. Also have the option to assume a copynumber of 1 for each sequence.
Even though it has three dependencies, sometimes compute_copynumber
does not wait for them to complete.
So insertions relative to reference structure are visible in PV.
poly-A filtering and alignment to reference steps may shorten a CCS so that it no longer passes the minimum length filter.
Detect this with a new length filter after shift correction, which will ensure the sequences are de-gapped.
Result checks currently raise generic exceptions, such as when a result file is empty. Those should be caught and reported by the Result class.
Record the start and stop time of each task. Use them to generate benchmarks and identify slow tasks.
Julia dependencies are not currently defined.
Current dependencies:
After quality filtering, but before alignment.
Running FUBAR on the PC76 data resulted in different-sized arrays for each timepoint in rates.json
. Each should have the same number of columns, equal to the number of columns in the multiple sequence alignment.
A long poly-A head can make the tail trimmer identify the whole sequence as tail, and vice-versa. Use a five-state HMM and Viterbi to implement trimming correctly.
Currently we are doing a full MSA for each cluster, only to generate the consensus sequence.
This will avoid errors like forgetting to pass length-filtered sequences downstream.
Name them after their cluster number, not after a random sequence in the cluster.
Log the following information:
Latest version takes gaps into account.
Currently the bad column summaries only show the top amino acid in each alignment, but these tend to be the same. Instead show the top three.
Could the qsub_files
directory not exist when first task of the pipeline finishes?
Until we find a way to completely package the pipeline to include all third party dependencies, we will need to define all binaries used in the config for the sake of proper version control and explicit environment configuration.
frequencies.json
and turnover.json
rates.json
For each step, show how many sequences were kept, and how many were discarded.
Deleting HQCSs during alignment editing should be reflected in the copynumbers, but currently it is not because they are done separately. Need to refactor so we do copynumber counting with the results of the edited alignment.
Package column-wise divergence and amino acid summaries into a json file.
Currently some tasks run locally. They are computationally light, but they should still be rewritten to run remotely.
The following can be done now, as the command-line scripts already exist:
The rest will have to be rewritten as scripts.
The shift correction algorithm may split and correct within codons. Until we come up with a good general-purpose algorithm, at least we can do shift correction in codon-length chunks.
Currently assumes all input paths are absolute. Handle relative paths.
setup.py
nose
ignores excutables.If the entire sequence is poly-A or poly-T, it should get dropped.
This should not happen, because the nucleotide alignment from which it is generated is a backtranslated amino-acid alignment. Possibly the backtranslation method is wrong.
The majority of sequences in the earliest timepoint have a 7 bp insertion after HXB2 coordinate 132. For some reason this insertion does not appear in the MRCA.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.