brendelgroup / aegean Goto Github PK

View Code? Open in Web Editor NEW

24.0 24.0 10.0 4.79 MB

Integrated toolkit for analysis and evaluation of annotated genomes

Home Page: http://brendelgroup.github.io/AEGeAn

License: ISC License

CSS 0.69% Python 37.21% Perl 0.61% Shell 1.29% C 59.51% Makefile 0.68%

aegean's People

Contributors

Stargazers

Watchers

Forkers

jfdenton standage vpbrendel altingia wbyu satta gerbenvoshol wangdi2014 qussai96 martin-g

aegean's Issues

CDS inference issue

If no CDS is detected, then immediately the gene validator assumes to infer the CDS. However, this depends on exons and codons, which may not be there if this is not a protein-coding gene. There should be a check for these before trying to infer the CDS.

Refactor comparative analysis code in AgnCliquePair class

It's long and messy: it could benefit from some work.

Refactor AgnGeneValidator class

The entire AgnGeneValidator class, and especially the mRNA validation function, is bulky and needs refactoring for better code organization.

Comparative analysis in `agn_calc_edit_distance` function

The agn_calc_edit_distance function is called by the function(s) that calculate splice complexity. The agn_calc_edit_distance function ends up doing a full comparative analysis, while only the edit distance is needed. Would it provide any substantial benefit to calculate edit distance between two transcripts independent of a full comparative assessment?

Method refactoring for AgnGeneLocus

A recent commit refactored the agn_gene_locus_get_clique_pairs and agn_gene_locus_find_best_pairs methods (now agn_gene_locus_enumerate_clique_pairs and agn_gene_locus_comparative_evaluation, respectively). However, for semantics' sake it may still be worth separating the methods that retrieve the return values (number of clique pairs and the best pairs for reporting) from the methods that calculate and store those values. Perhaps those should be private?

Expose AEGeAn core functionality with a library archive

Check out man ar
add core AEGeAn classes and modules to libaegean.a library
should be accessible by arbitrary C code with -laegean flag
will need a script to create documentation from annotated header files

Address issues on Bitbucket

BitBucket was originally used for hosting AEGeAn, and some issues are still being tracked there. These need to be addressed and/or ported over to Github.

Collect all errors rather than terminate after a single one

There are several places in the core code where a loop is terminated prematurely if errors are detected in the logger. On one hand, the sooner the program terminates after detection of an error, the more likely it is that the error message(s) will be useful for addressing problems with the data. On the other hand, if the loop is allowed to continue and finish parsing all data, then all problems can be reported at once (so they can be fixed all at once rather one at a time).

I should consider whether to allow these loops to run to completion even after errors have been detected. Preliminary testing suggests this shouldn't cause many problems.

Feature node: rename source?

In an effort to completely decouple AEGeAn from any non-API GenomeTools code, the code to reset a feature node's source attribute has been removed. I need to work with the GenomeTools developers to discuss potential solutions.

Comparative evaluation code, data structures

Some functions and data structures were previously designated as "ParsEval-specific" and the code was refactored accordingly so that they did not clutter the core code and API. Now that I am writing the geneannology diff command, however, I see how much of this really should be core code. This deserves another look-over to see how much ParsEval-specific code there really should be.

Javascript frameworks

This project is implemented entirely in C, although it is listed as a JavaScript project because the bundled jQuery and Mootools source code outweighs my own C code. I need to rename the JavaScript files so as to be recognized as frameworks (see https://github.com/github/linguist/blob/master/lib/linguist/vendor.yml).

Grep for 'stderr' reveals many inappropriate fprintf calls...

...these need to be fixed now that the AgnLogger class is available.

AgnError vs AgnLogger

I'm currently using the AgnError class to store warning messages and/or non-fatal error messages. This distinction is silly, but I like the functionality this class provides. Perhaps I should revamp the class altogether, rename it AgnLogger, and enable it to keep track of error messages (always fatal), warning messages (never fatal), and status messages.

AgnLocus print function...

...was designed to print GFF3, but it needs to be reviewed.

AgnLocusIndex sequence ranges

The AgnLocusIndex object should keep track of ranges for sequences that it parses.

New interface design proposal for GeneAnnoLogy

union: add new annotations, keep all

intersect: add new annotations
  -r|--replace    delete existing annotations that overlap with new annotations
  -c|--complement   do not add new annotations that overlap with existing annotations

clean: start all over again

update: modify existing annotations
  -n|--new: STRING    add transcript as new isoform, provide ID
  -r|--replace: STRING    provide ID of transcript to replace
  -c|--clean    start fresh at locus

delete: delete

Verify that infer exon function works with discontinuous UTRs

The issue title is pretty self-explanatory. From the code, it looks like it will not infer UTR-only exons.

ParsEval: report all loci

Currently, ParsEval looks for shared sequences and reports only loci on shared sequences. Instead, it should report all sequences and all loci.

Interval loci issues

terminal loci are not handled correctly
there are still some off-by-one errors that need to be addressed

Numbers wrong in sequence locus table (HTML output)

Since changing to the AgnLocusIndex code in ParsEval, the table of loci on the sequence page displays incorrect numbers for the number of reference genes, prediction genes, and comparisons per locus.

Filter function for AgnLocusIndex

I should consider implementing a function in the AgnLocusIndex class that would remove loci based on a provided set of filters. This would be much easier than inserting filter checks into subsequent analysis.

Refactoring of code in parseval.c

If AEGeAn, and ParsEval in particular, is going to be made available as a shared library, then the code needs better organization. Ideally, the parseval.c program would be pretty minimal, making calls to a small number functions that another developer could use. These functions should utilize the AgnError class rather than the GtError class, although of course they may need to use the GtError class internally.

Reorganize 'agn_clique_pair_add_to_vector' function

The agn_clique_pair_add_to_vector function was originally included in the AgnCliquePair class, IIRC, since the locus range is needed to add the transcript's features to the model vector. However, this could (and probably should) be generalized and moved to a more general module/class.

Move PNG-related code to PeReports module

There is some Cairo-related graphics code in the AgnPairwiseCompareLocus class. I should consider moving this to the PeReports module, as graphics for other AEGeAn tools aren't on the roadmap any time in the near future.

Refactor `agn_pairwise_compare_locus_get_clique_pairs` function...

...it seems like an easy fix.

Code documentation

Documentation is missing for several methods/functions. Many of these have been marked with FIXME, but some may not have been.

Memory leak check

AEGeAn has not been analyzed for memory leaks in quite a while, and I suspect there are several. It is time.

Code to parse loci

The code for parsing loci should be accessible by different API functions, with definitions something like these.

/**
 * Given a pair of annotation feature sets in memory, identify loci while
 * keeping the two sources of annotation separate (to enable comparison).
 *
 * @param[in]  refrfeats    reference annotations
 * @param[in]  predfeats    prediction annotations
 * @param[out] error        error object to which warning/error messages will be
 *                          written if necessary
 * @returns                 a GtArray containing pointers to GtArrays A_1, A_2,
 *                          ..., A_n (if the reference and prediction share
 *                          n annotated sequences), each A_i containing pointers
 *                          to AgnPairwiseCompareLocus objects
 */
GtArray *agn_parse_pairwise_loci_memory(GtFeatureIndex *refrfeats,
                                        GtFeatureIndex *predfeats,
                                        AgnError *error);

/**
 * Given a pair of annotation feature sets in GFF3 files, identify loci while
 * keeping the two sources of annotation separate (to enable comparison).
 *
 * @param[in]  refrfeats    path to GFF3 file containing reference annotations
 * @param[in]  predfeats    path to GFF3 file containing prediction annotations
 * @param[out] error        error object to which warning/error messages will be
 *                          written if necessary
 * @returns                 a GtArray containing pointers to GtArrays A_1, A_2,
 *                          ..., A_n (if the reference and prediction share
 *                          n annotated sequences), each A_i containing pointers
 *                          to AgnPairwiseCompareLocus objects
 */
GtArray *agn_parse_pairwise_loci_disk(const char *refrfile,
                                      const char *predfile,
                                      AgnError *error);

/**
 * Identify loci given an index of annotation features.
 *
 * @param[in]  features    index containing valid annotation features (FIXME)
 * @param[out] error       error object to which warning/error messages will be
 *                         written if necessary
 * @returns                a GtArray containing pointers to GtArrays A_1, A_2,
 *                         ..., A_n (if the reference and prediction share
 *                         n annotated sequences), each A_i containing pointers
 *                         to AgnLocus objects
 */
GtArray *agn_parse_loci_memory(GtFeatureIndex *features, AgnError *error);

/**
 * Identify loci given a list of annotation files.
 *
 * @param[in]  numfiles     the number of annotation files
 * @param[in]  filenames    list of filenames corresponding to GFF3 files
 * @param[out] error        error object to which warning/error messages will be
 *                          written if necessary
 * @returns                 a GtArray containing pointers to GtArrays A_1, A_2,
 *                          ..., A_n (if the reference and prediction share
 *                          n annotated sequences), each A_i containing pointers
 *                          to AgnLocus objects
 */
GtArray *agn_parse_loci_disk(int numfiles, const char **filenames,
                             AgnError *error);

Remove `locusgff3` option from ParsEval

This functionality is now trivial with LocusPocus.

Replace ga_get_loci_from_feature_index with AgnLocusIndex

It should be fairly simple to replace the ga_get_loci_from_feature_index function with the functionality of the AgnLocusIndex class: this is precisely what the class was designed for.

Concern about AgnLocusIndex delete

The AgnLocusIndex class maintains a hashmap of interval trees, each of which holds gene locus objects. For memory management, it would be much easier to add agn_gene_locus_delete as the interval tree's free function. However, the consequence is that all gene loci would remain in memory until the locus index is freed. Currently, ParsEval deletes loci as it analyzes them to reduce the memory footprint. I'll have to brainstorm a good solution to this.

The 'agn_gene_locus_*_gene_ids' functions need inspection

In the HTML output mode, these seem to be having an issue. Grabbing the genes directly and printing out their IDs seems to work fine though.

Intron-containing UTR causes graphics trouble

I was using ParsEval to compare some new Maker annotations to some older annotations, and I found a graphics issue that I think is related to intron-containing UTRs. Most UTRs reported by Maker have no ID attribute, but those that contain introns require multiple lines in the GFF3 file, and therefore Maker uses the ID attribute to indicate which entries/lines correspond to the same UTR.

I'm guessing this is the cause of some strange behavior I'm seeing with the GenomeTools/AnnotationSketch-generated graphics. Here is a ParsEval report whose input included an intron-containing UTRs as reported by Maker...

...and here is the same report for the same input, minus the UTR features (so that they were inferred by ParsEval).

I need to come up with a few more use cases to confirm that this is the cause of the behavior, then bring this up with the GenomeTools developers and determine whether it's a bug or a feature.

Copy filter file without install

ParsEval has -a option for copying shared data from non-standard location, this needs to be extended to filter files.

Write unit tests

Several functional tests have been integrated, but AEGeAn needs some unit tests. All of this could eventually be connected to Travis CI for automated testing.

Add filter to GFF3 out function, CanonGFF3 program

It might be nice to implement a filter for the function(s) that write GFF3 data to output, specifying feature types to be skipped. This could then easily be integrated with the CanonGFF3 program.

The 'agn_parse_canonical' function too stringent for parsing simple loci

The AgnLocusIndex class uses the agn_parse_canonical function, which does gene validation, to parse loci. This is (perhaps?) appropriate for parsing loci intended for pairwise comparison, but the simple locus parsing methods were written so as to handle any data types correctly, so validation is not required. It might be worth looking into whether this can be improved.

Reduce reduncancy in '*_visit_feature_node' functions

In several places, both in stable code (pushed onto GitHub) and unstable code (locally), there are feature node visitor functions that have something of the following form.

  if(gt_feature_node_is_pseudo(fn))
  {
    GtArray *features = gt_array_new( sizeof(GtFeatureNode *) );
    agn_gt_feature_node_resolve_pseudo_node(fn, features);
    while(gt_array_size(features) > 0)
    {
      // CODEBLOCK 1: process child of pseudo node
    }
    gt_array_delete(features);
  }
  else
  {
    // CODEBLOCK 2: process normal feature node
  }

The code in CODEBLOCK 1 and CODEBLOCK 2 is functionally identical. This adds size and complexity to the code base, and makes things like maintenance and debugging more difficult. This should be fixed to remove redundancy and come up with a single approach to handle both cases.

Git version number

Currently, the ParsEval HTML output has the AEGeAn version number printed at the bottom of each page. This is the desired behavior for stable releases, but for someone who checks out the code with Git, it would be preferable to print the commit SHA1 instead.

So when compiling the code, the Makefile should check whether there is a git repository. If so, it should grab the SHA1 of the latest commit, and then somehow define/replace the existing one when compiling.

Document public API

The public API needs documentation before I can really claim the ability to link against the AEGeAn library. The good: the documentation is already there. The bad: I probably need to write my own script to process it and create something easily readable.

Add Pango to install instructions

When the ParsEval docs were initially written, Pango was not a prerequisite for GenomeTools installation. Now that it is, I should add it to the install documentation.

Pairwise comparison locus parsing incorrect

The agn_parse_loci command is incorrect, in that it treats each unique prediction gene as its own locus, when it's possible it may overlap with other prediction genes. This should be an easy quick fix. Also needs to be applied to AgnLocusIndex class.

ID collisions

GeneAnnoLogy needs a better mechanism to prevent errors caused by ID collisions.

OpenMP threads for locus parsing

The AgnLocusIndex methods for locus parsing include hooks for setting the number of OpenMP threads to use during locus parsing, but these are not used. I need to decide the best way to handle this.

Testing for interval locus parsing

I have done some minimal testing for the new interval locus parsing logic, but this needs more attention before it can be classified as stable.

GeneAnnoLogy cat -> show

Show the GFF3 data for the specified annotations
Usage: geneannology show [options] repo
  Options:
    -c|--commit: STRING    show data as they were as of the specified snapshot
                           (as identified by the corresponding SHA1 hash);
                           default is current data
    -h|--help              print this help message and exit
    -r|--range: INT,INT    show data only for the specified range; must also
                           specify a sequence ID
    -s|--seqid: STRING     show data only for the specified sequence

cannon-gff3 and locuspocus dependencies

Both the canon-gff3 program and the locuspocus program depend on code that is currently organized in ParsEval classes/modules. This needs to be refactored.

ParsEval: transcript clique pair enumeration

ParsEval will not try to enumerate and compare all of the clique pairs for loci with complex transcript structures yielding an inordinate amount of clique pairs. That's the idea at least. It seems in some cases that ParsEval will still hang on a locus for a long time, only to report that there are too many clique pairs to analyze and move on.

The idea behind this check is that potential time sinks would be avoided, so the fact that ParsEval is still hanging on some loci is unfortunate. I need to check whether this is a problem with my implementation, or whether enumerating all the clique pairs is prerequisite to this check.

Data files used by a colleague that caused the hang are at http://gremlin2.soic.indiana.edu/tmp/parseval-sowmya/, specifically the locus Group6.36[339737, 491606].

Access ParsEval comparisons by category

ParsEval HTML output is organized by sequence and sorted by location. There is no easy way to find all the loci containing perfect matches, or CDS matches, etc. I implemented a script to get this from CSV output (https://gist.github.com/standage/5414796), but it would be very convenient to have this information readily in the HTML output.

Exons not inferred properly with intron-containing UTRs

For transcripts whose UTRs contain introns, it seems that exons are not inferred properly.