Giter VIP home page Giter VIP logo

aegean's People

Contributors

jfdenton avatar martin-g avatar satta avatar standage avatar timlai4 avatar vpbrendel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

aegean's Issues

CDS inference issue

If no CDS is detected, then immediately the gene validator assumes to infer the CDS. However, this depends on exons and codons, which may not be there if this is not a protein-coding gene. There should be a check for these before trying to infer the CDS.
screen shot 2013-06-07 at 4 14 27 pm

Refactor AgnGeneValidator class

The entire AgnGeneValidator class, and especially the mRNA validation function, is bulky and needs refactoring for better code organization.

Comparative analysis in `agn_calc_edit_distance` function

The agn_calc_edit_distance function is called by the function(s) that calculate splice complexity. The agn_calc_edit_distance function ends up doing a full comparative analysis, while only the edit distance is needed. Would it provide any substantial benefit to calculate edit distance between two transcripts independent of a full comparative assessment?

Method refactoring for AgnGeneLocus

A recent commit refactored the agn_gene_locus_get_clique_pairs and agn_gene_locus_find_best_pairs methods (now agn_gene_locus_enumerate_clique_pairs and agn_gene_locus_comparative_evaluation, respectively). However, for semantics' sake it may still be worth separating the methods that retrieve the return values (number of clique pairs and the best pairs for reporting) from the methods that calculate and store those values. Perhaps those should be private?

Address issues on Bitbucket

BitBucket was originally used for hosting AEGeAn, and some issues are still being tracked there. These need to be addressed and/or ported over to Github.

Collect all errors rather than terminate after a single one

There are several places in the core code where a loop is terminated prematurely if errors are detected in the logger. On one hand, the sooner the program terminates after detection of an error, the more likely it is that the error message(s) will be useful for addressing problems with the data. On the other hand, if the loop is allowed to continue and finish parsing all data, then all problems can be reported at once (so they can be fixed all at once rather one at a time).

I should consider whether to allow these loops to run to completion even after errors have been detected. Preliminary testing suggests this shouldn't cause many problems.

Feature node: rename source?

In an effort to completely decouple AEGeAn from any non-API GenomeTools code, the code to reset a feature node's source attribute has been removed. I need to work with the GenomeTools developers to discuss potential solutions.

Comparative evaluation code, data structures

Some functions and data structures were previously designated as "ParsEval-specific" and the code was refactored accordingly so that they did not clutter the core code and API. Now that I am writing the geneannology diff command, however, I see how much of this really should be core code. This deserves another look-over to see how much ParsEval-specific code there really should be.

AgnError vs AgnLogger

I'm currently using the AgnError class to store warning messages and/or non-fatal error messages. This distinction is silly, but I like the functionality this class provides. Perhaps I should revamp the class altogether, rename it AgnLogger, and enable it to keep track of error messages (always fatal), warning messages (never fatal), and status messages.

New interface design proposal for GeneAnnoLogy

union: add new annotations, keep all

intersect: add new annotations
  -r|--replace    delete existing annotations that overlap with new annotations
  -c|--complement   do not add new annotations that overlap with existing annotations

clean: start all over again

update: modify existing annotations
  -n|--new: STRING    add transcript as new isoform, provide ID
  -r|--replace: STRING    provide ID of transcript to replace
  -c|--clean    start fresh at locus

delete: delete

ParsEval: report all loci

Currently, ParsEval looks for shared sequences and reports only loci on shared sequences. Instead, it should report all sequences and all loci.

Interval loci issues

  • terminal loci are not handled correctly
  • there are still some off-by-one errors that need to be addressed

Filter function for AgnLocusIndex

I should consider implementing a function in the AgnLocusIndex class that would remove loci based on a provided set of filters. This would be much easier than inserting filter checks into subsequent analysis.

Refactoring of code in parseval.c

If AEGeAn, and ParsEval in particular, is going to be made available as a shared library, then the code needs better organization. Ideally, the parseval.c program would be pretty minimal, making calls to a small number functions that another developer could use. These functions should utilize the AgnError class rather than the GtError class, although of course they may need to use the GtError class internally.

Reorganize 'agn_clique_pair_add_to_vector' function

The agn_clique_pair_add_to_vector function was originally included in the AgnCliquePair class, IIRC, since the locus range is needed to add the transcript's features to the model vector. However, this could (and probably should) be generalized and moved to a more general module/class.

Move PNG-related code to PeReports module

There is some Cairo-related graphics code in the AgnPairwiseCompareLocus class. I should consider moving this to the PeReports module, as graphics for other AEGeAn tools aren't on the roadmap any time in the near future.

Code documentation

Documentation is missing for several methods/functions. Many of these have been marked with FIXME, but some may not have been.

Memory leak check

AEGeAn has not been analyzed for memory leaks in quite a while, and I suspect there are several. It is time.

Code to parse loci

The code for parsing loci should be accessible by different API functions, with definitions something like these.

/**
 * Given a pair of annotation feature sets in memory, identify loci while
 * keeping the two sources of annotation separate (to enable comparison).
 *
 * @param[in]  refrfeats    reference annotations
 * @param[in]  predfeats    prediction annotations
 * @param[out] error        error object to which warning/error messages will be
 *                          written if necessary
 * @returns                 a GtArray containing pointers to GtArrays A_1, A_2,
 *                          ..., A_n (if the reference and prediction share
 *                          n annotated sequences), each A_i containing pointers
 *                          to AgnPairwiseCompareLocus objects
 */
GtArray *agn_parse_pairwise_loci_memory(GtFeatureIndex *refrfeats,
                                        GtFeatureIndex *predfeats,
                                        AgnError *error);

/**
 * Given a pair of annotation feature sets in GFF3 files, identify loci while
 * keeping the two sources of annotation separate (to enable comparison).
 *
 * @param[in]  refrfeats    path to GFF3 file containing reference annotations
 * @param[in]  predfeats    path to GFF3 file containing prediction annotations
 * @param[out] error        error object to which warning/error messages will be
 *                          written if necessary
 * @returns                 a GtArray containing pointers to GtArrays A_1, A_2,
 *                          ..., A_n (if the reference and prediction share
 *                          n annotated sequences), each A_i containing pointers
 *                          to AgnPairwiseCompareLocus objects
 */
GtArray *agn_parse_pairwise_loci_disk(const char *refrfile,
                                      const char *predfile,
                                      AgnError *error);

/**
 * Identify loci given an index of annotation features.
 *
 * @param[in]  features    index containing valid annotation features (FIXME)
 * @param[out] error       error object to which warning/error messages will be
 *                         written if necessary
 * @returns                a GtArray containing pointers to GtArrays A_1, A_2,
 *                         ..., A_n (if the reference and prediction share
 *                         n annotated sequences), each A_i containing pointers
 *                         to AgnLocus objects
 */
GtArray *agn_parse_loci_memory(GtFeatureIndex *features, AgnError *error);

/**
 * Identify loci given a list of annotation files.
 *
 * @param[in]  numfiles     the number of annotation files
 * @param[in]  filenames    list of filenames corresponding to GFF3 files
 * @param[out] error        error object to which warning/error messages will be
 *                          written if necessary
 * @returns                 a GtArray containing pointers to GtArrays A_1, A_2,
 *                          ..., A_n (if the reference and prediction share
 *                          n annotated sequences), each A_i containing pointers
 *                          to AgnLocus objects
 */
GtArray *agn_parse_loci_disk(int numfiles, const char **filenames,
                             AgnError *error);

Concern about AgnLocusIndex delete

The AgnLocusIndex class maintains a hashmap of interval trees, each of which holds gene locus objects. For memory management, it would be much easier to add agn_gene_locus_delete as the interval tree's free function. However, the consequence is that all gene loci would remain in memory until the locus index is freed. Currently, ParsEval deletes loci as it analyzes them to reduce the memory footprint. I'll have to brainstorm a good solution to this.

Intron-containing UTR causes graphics trouble

I was using ParsEval to compare some new Maker annotations to some older annotations, and I found a graphics issue that I think is related to intron-containing UTRs. Most UTRs reported by Maker have no ID attribute, but those that contain introns require multiple lines in the GFF3 file, and therefore Maker uses the ID attribute to indicate which entries/lines correspond to the same UTR.

I'm guessing this is the cause of some strange behavior I'm seeing with the GenomeTools/AnnotationSketch-generated graphics. Here is a ParsEval report whose input included an intron-containing UTRs as reported by Maker...

screen shot 2013-05-29 at 10 00 47 am

...and here is the same report for the same input, minus the UTR features (so that they were inferred by ParsEval).

screen shot 2013-05-29 at 10 00 22 am

I need to come up with a few more use cases to confirm that this is the cause of the behavior, then bring this up with the GenomeTools developers and determine whether it's a bug or a feature.

Write unit tests

Several functional tests have been integrated, but AEGeAn needs some unit tests. All of this could eventually be connected to Travis CI for automated testing.

The 'agn_parse_canonical' function too stringent for parsing simple loci

The AgnLocusIndex class uses the agn_parse_canonical function, which does gene validation, to parse loci. This is (perhaps?) appropriate for parsing loci intended for pairwise comparison, but the simple locus parsing methods were written so as to handle any data types correctly, so validation is not required. It might be worth looking into whether this can be improved.

Reduce reduncancy in '*_visit_feature_node' functions

In several places, both in stable code (pushed onto GitHub) and unstable code (locally), there are feature node visitor functions that have something of the following form.

  if(gt_feature_node_is_pseudo(fn))
  {
    GtArray *features = gt_array_new( sizeof(GtFeatureNode *) );
    agn_gt_feature_node_resolve_pseudo_node(fn, features);
    while(gt_array_size(features) > 0)
    {
      // CODEBLOCK 1: process child of pseudo node
    }
    gt_array_delete(features);
  }
  else
  {
    // CODEBLOCK 2: process normal feature node
  }

The code in CODEBLOCK 1 and CODEBLOCK 2 is functionally identical. This adds size and complexity to the code base, and makes things like maintenance and debugging more difficult. This should be fixed to remove redundancy and come up with a single approach to handle both cases.

Git version number

Currently, the ParsEval HTML output has the AEGeAn version number printed at the bottom of each page. This is the desired behavior for stable releases, but for someone who checks out the code with Git, it would be preferable to print the commit SHA1 instead.

So when compiling the code, the Makefile should check whether there is a git repository. If so, it should grab the SHA1 of the latest commit, and then somehow define/replace the existing one when compiling.

Document public API

The public API needs documentation before I can really claim the ability to link against the AEGeAn library. The good: the documentation is already there. The bad: I probably need to write my own script to process it and create something easily readable.

Add Pango to install instructions

When the ParsEval docs were initially written, Pango was not a prerequisite for GenomeTools installation. Now that it is, I should add it to the install documentation.

Pairwise comparison locus parsing incorrect

The agn_parse_loci command is incorrect, in that it treats each unique prediction gene as its own locus, when it's possible it may overlap with other prediction genes. This should be an easy quick fix. Also needs to be applied to AgnLocusIndex class.

ID collisions

GeneAnnoLogy needs a better mechanism to prevent errors caused by ID collisions.

OpenMP threads for locus parsing

The AgnLocusIndex methods for locus parsing include hooks for setting the number of OpenMP threads to use during locus parsing, but these are not used. I need to decide the best way to handle this.

Testing for interval locus parsing

I have done some minimal testing for the new interval locus parsing logic, but this needs more attention before it can be classified as stable.

GeneAnnoLogy cat -> show

Show the GFF3 data for the specified annotations
Usage: geneannology show [options] repo
  Options:
    -c|--commit: STRING    show data as they were as of the specified snapshot
                           (as identified by the corresponding SHA1 hash);
                           default is current data
    -h|--help              print this help message and exit
    -r|--range: INT,INT    show data only for the specified range; must also
                           specify a sequence ID
    -s|--seqid: STRING     show data only for the specified sequence

cannon-gff3 and locuspocus dependencies

Both the canon-gff3 program and the locuspocus program depend on code that is currently organized in ParsEval classes/modules. This needs to be refactored.

ParsEval: transcript clique pair enumeration

ParsEval will not try to enumerate and compare all of the clique pairs for loci with complex transcript structures yielding an inordinate amount of clique pairs. That's the idea at least. It seems in some cases that ParsEval will still hang on a locus for a long time, only to report that there are too many clique pairs to analyze and move on.

The idea behind this check is that potential time sinks would be avoided, so the fact that ParsEval is still hanging on some loci is unfortunate. I need to check whether this is a problem with my implementation, or whether enumerating all the clique pairs is prerequisite to this check.

Data files used by a colleague that caused the hang are at http://gremlin2.soic.indiana.edu/tmp/parseval-sowmya/, specifically the locus Group6.36[339737, 491606].

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.