brendelgroup / aegean Goto Github PK
View Code? Open in Web Editor NEWIntegrated toolkit for analysis and evaluation of annotated genomes
Home Page: http://brendelgroup.github.io/AEGeAn
License: ISC License
Integrated toolkit for analysis and evaluation of annotated genomes
Home Page: http://brendelgroup.github.io/AEGeAn
License: ISC License
It's long and messy: it could benefit from some work.
The entire AgnGeneValidator class, and especially the mRNA validation function, is bulky and needs refactoring for better code organization.
The agn_calc_edit_distance
function is called by the function(s) that calculate splice complexity. The agn_calc_edit_distance
function ends up doing a full comparative analysis, while only the edit distance is needed. Would it provide any substantial benefit to calculate edit distance between two transcripts independent of a full comparative assessment?
A recent commit refactored the agn_gene_locus_get_clique_pairs
and agn_gene_locus_find_best_pairs
methods (now agn_gene_locus_enumerate_clique_pairs
and agn_gene_locus_comparative_evaluation
, respectively). However, for semantics' sake it may still be worth separating the methods that retrieve the return values (number of clique pairs and the best pairs for reporting) from the methods that calculate and store those values. Perhaps those should be private?
man ar
libaegean.a
library-laegean
flagBitBucket was originally used for hosting AEGeAn, and some issues are still being tracked there. These need to be addressed and/or ported over to Github.
There are several places in the core code where a loop is terminated prematurely if errors are detected in the logger. On one hand, the sooner the program terminates after detection of an error, the more likely it is that the error message(s) will be useful for addressing problems with the data. On the other hand, if the loop is allowed to continue and finish parsing all data, then all problems can be reported at once (so they can be fixed all at once rather one at a time).
I should consider whether to allow these loops to run to completion even after errors have been detected. Preliminary testing suggests this shouldn't cause many problems.
In an effort to completely decouple AEGeAn from any non-API GenomeTools code, the code to reset a feature node's source attribute has been removed. I need to work with the GenomeTools developers to discuss potential solutions.
Some functions and data structures were previously designated as "ParsEval-specific" and the code was refactored accordingly so that they did not clutter the core code and API. Now that I am writing the geneannology diff
command, however, I see how much of this really should be core code. This deserves another look-over to see how much ParsEval-specific code there really should be.
This project is implemented entirely in C, although it is listed as a JavaScript project because the bundled jQuery and Mootools source code outweighs my own C code. I need to rename the JavaScript files so as to be recognized as frameworks (see https://github.com/github/linguist/blob/master/lib/linguist/vendor.yml).
...these need to be fixed now that the AgnLogger class is available.
I'm currently using the AgnError class to store warning messages and/or non-fatal error messages. This distinction is silly, but I like the functionality this class provides. Perhaps I should revamp the class altogether, rename it AgnLogger, and enable it to keep track of error messages (always fatal), warning messages (never fatal), and status messages.
...was designed to print GFF3, but it needs to be reviewed.
The AgnLocusIndex object should keep track of ranges for sequences that it parses.
union: add new annotations, keep all
intersect: add new annotations
-r|--replace delete existing annotations that overlap with new annotations
-c|--complement do not add new annotations that overlap with existing annotations
clean: start all over again
update: modify existing annotations
-n|--new: STRING add transcript as new isoform, provide ID
-r|--replace: STRING provide ID of transcript to replace
-c|--clean start fresh at locus
delete: delete
The issue title is pretty self-explanatory. From the code, it looks like it will not infer UTR-only exons.
Currently, ParsEval looks for shared sequences and reports only loci on shared sequences. Instead, it should report all sequences and all loci.
Since changing to the AgnLocusIndex
code in ParsEval, the table of loci on the sequence page displays incorrect numbers for the number of reference genes, prediction genes, and comparisons per locus.
I should consider implementing a function in the AgnLocusIndex
class that would remove loci based on a provided set of filters. This would be much easier than inserting filter checks into subsequent analysis.
If AEGeAn, and ParsEval in particular, is going to be made available as a shared library, then the code needs better organization. Ideally, the parseval.c program would be pretty minimal, making calls to a small number functions that another developer could use. These functions should utilize the AgnError class rather than the GtError class, although of course they may need to use the GtError class internally.
The agn_clique_pair_add_to_vector
function was originally included in the AgnCliquePair class, IIRC, since the locus range is needed to add the transcript's features to the model vector. However, this could (and probably should) be generalized and moved to a more general module/class.
There is some Cairo-related graphics code in the AgnPairwiseCompareLocus class. I should consider moving this to the PeReports module, as graphics for other AEGeAn tools aren't on the roadmap any time in the near future.
...it seems like an easy fix.
Documentation is missing for several methods/functions. Many of these have been marked with FIXME
, but some may not have been.
AEGeAn has not been analyzed for memory leaks in quite a while, and I suspect there are several. It is time.
The code for parsing loci should be accessible by different API functions, with definitions something like these.
/**
* Given a pair of annotation feature sets in memory, identify loci while
* keeping the two sources of annotation separate (to enable comparison).
*
* @param[in] refrfeats reference annotations
* @param[in] predfeats prediction annotations
* @param[out] error error object to which warning/error messages will be
* written if necessary
* @returns a GtArray containing pointers to GtArrays A_1, A_2,
* ..., A_n (if the reference and prediction share
* n annotated sequences), each A_i containing pointers
* to AgnPairwiseCompareLocus objects
*/
GtArray *agn_parse_pairwise_loci_memory(GtFeatureIndex *refrfeats,
GtFeatureIndex *predfeats,
AgnError *error);
/**
* Given a pair of annotation feature sets in GFF3 files, identify loci while
* keeping the two sources of annotation separate (to enable comparison).
*
* @param[in] refrfeats path to GFF3 file containing reference annotations
* @param[in] predfeats path to GFF3 file containing prediction annotations
* @param[out] error error object to which warning/error messages will be
* written if necessary
* @returns a GtArray containing pointers to GtArrays A_1, A_2,
* ..., A_n (if the reference and prediction share
* n annotated sequences), each A_i containing pointers
* to AgnPairwiseCompareLocus objects
*/
GtArray *agn_parse_pairwise_loci_disk(const char *refrfile,
const char *predfile,
AgnError *error);
/**
* Identify loci given an index of annotation features.
*
* @param[in] features index containing valid annotation features (FIXME)
* @param[out] error error object to which warning/error messages will be
* written if necessary
* @returns a GtArray containing pointers to GtArrays A_1, A_2,
* ..., A_n (if the reference and prediction share
* n annotated sequences), each A_i containing pointers
* to AgnLocus objects
*/
GtArray *agn_parse_loci_memory(GtFeatureIndex *features, AgnError *error);
/**
* Identify loci given a list of annotation files.
*
* @param[in] numfiles the number of annotation files
* @param[in] filenames list of filenames corresponding to GFF3 files
* @param[out] error error object to which warning/error messages will be
* written if necessary
* @returns a GtArray containing pointers to GtArrays A_1, A_2,
* ..., A_n (if the reference and prediction share
* n annotated sequences), each A_i containing pointers
* to AgnLocus objects
*/
GtArray *agn_parse_loci_disk(int numfiles, const char **filenames,
AgnError *error);
This functionality is now trivial with LocusPocus.
It should be fairly simple to replace the ga_get_loci_from_feature_index
function with the functionality of the AgnLocusIndex
class: this is precisely what the class was designed for.
The AgnLocusIndex
class maintains a hashmap of interval trees, each of which holds gene locus objects. For memory management, it would be much easier to add agn_gene_locus_delete
as the interval tree's free function. However, the consequence is that all gene loci would remain in memory until the locus index is freed. Currently, ParsEval deletes loci as it analyzes them to reduce the memory footprint. I'll have to brainstorm a good solution to this.
In the HTML output mode, these seem to be having an issue. Grabbing the genes directly and printing out their IDs seems to work fine though.
I was using ParsEval to compare some new Maker annotations to some older annotations, and I found a graphics issue that I think is related to intron-containing UTRs. Most UTRs reported by Maker have no ID attribute, but those that contain introns require multiple lines in the GFF3 file, and therefore Maker uses the ID attribute to indicate which entries/lines correspond to the same UTR.
I'm guessing this is the cause of some strange behavior I'm seeing with the GenomeTools/AnnotationSketch-generated graphics. Here is a ParsEval report whose input included an intron-containing UTRs as reported by Maker...
...and here is the same report for the same input, minus the UTR features (so that they were inferred by ParsEval).
I need to come up with a few more use cases to confirm that this is the cause of the behavior, then bring this up with the GenomeTools developers and determine whether it's a bug or a feature.
ParsEval has -a
option for copying shared data from non-standard location, this needs to be extended to filter files.
Several functional tests have been integrated, but AEGeAn needs some unit tests. All of this could eventually be connected to Travis CI for automated testing.
It might be nice to implement a filter for the function(s) that write GFF3 data to output, specifying feature types to be skipped. This could then easily be integrated with the CanonGFF3 program.
The AgnLocusIndex
class uses the agn_parse_canonical
function, which does gene validation, to parse loci. This is (perhaps?) appropriate for parsing loci intended for pairwise comparison, but the simple locus parsing methods were written so as to handle any data types correctly, so validation is not required. It might be worth looking into whether this can be improved.
In several places, both in stable code (pushed onto GitHub) and unstable code (locally), there are feature node visitor functions that have something of the following form.
if(gt_feature_node_is_pseudo(fn))
{
GtArray *features = gt_array_new( sizeof(GtFeatureNode *) );
agn_gt_feature_node_resolve_pseudo_node(fn, features);
while(gt_array_size(features) > 0)
{
// CODEBLOCK 1: process child of pseudo node
}
gt_array_delete(features);
}
else
{
// CODEBLOCK 2: process normal feature node
}
The code in CODEBLOCK 1 and CODEBLOCK 2 is functionally identical. This adds size and complexity to the code base, and makes things like maintenance and debugging more difficult. This should be fixed to remove redundancy and come up with a single approach to handle both cases.
Currently, the ParsEval HTML output has the AEGeAn version number printed at the bottom of each page. This is the desired behavior for stable releases, but for someone who checks out the code with Git, it would be preferable to print the commit SHA1 instead.
So when compiling the code, the Makefile should check whether there is a git repository. If so, it should grab the SHA1 of the latest commit, and then somehow define/replace the existing one when compiling.
The public API needs documentation before I can really claim the ability to link against the AEGeAn library. The good: the documentation is already there. The bad: I probably need to write my own script to process it and create something easily readable.
When the ParsEval docs were initially written, Pango was not a prerequisite for GenomeTools installation. Now that it is, I should add it to the install documentation.
The agn_parse_loci
command is incorrect, in that it treats each unique prediction gene as its own locus, when it's possible it may overlap with other prediction genes. This should be an easy quick fix. Also needs to be applied to AgnLocusIndex
class.
GeneAnnoLogy needs a better mechanism to prevent errors caused by ID collisions.
The AgnLocusIndex
methods for locus parsing include hooks for setting the number of OpenMP threads to use during locus parsing, but these are not used. I need to decide the best way to handle this.
I have done some minimal testing for the new interval locus parsing logic, but this needs more attention before it can be classified as stable.
Show the GFF3 data for the specified annotations
Usage: geneannology show [options] repo
Options:
-c|--commit: STRING show data as they were as of the specified snapshot
(as identified by the corresponding SHA1 hash);
default is current data
-h|--help print this help message and exit
-r|--range: INT,INT show data only for the specified range; must also
specify a sequence ID
-s|--seqid: STRING show data only for the specified sequence
Both the canon-gff3
program and the locuspocus
program depend on code that is currently organized in ParsEval classes/modules. This needs to be refactored.
ParsEval will not try to enumerate and compare all of the clique pairs for loci with complex transcript structures yielding an inordinate amount of clique pairs. That's the idea at least. It seems in some cases that ParsEval will still hang on a locus for a long time, only to report that there are too many clique pairs to analyze and move on.
The idea behind this check is that potential time sinks would be avoided, so the fact that ParsEval is still hanging on some loci is unfortunate. I need to check whether this is a problem with my implementation, or whether enumerating all the clique pairs is prerequisite to this check.
Data files used by a colleague that caused the hang are at http://gremlin2.soic.indiana.edu/tmp/parseval-sowmya/, specifically the locus Group6.36[339737, 491606].
ParsEval HTML output is organized by sequence and sorted by location. There is no easy way to find all the loci containing perfect matches, or CDS matches, etc. I implemented a script to get this from CSV output (https://gist.github.com/standage/5414796), but it would be very convenient to have this information readily in the HTML output.
For transcripts whose UTRs contain introns, it seems that exons are not inferred properly.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.