Giter VIP home page Giter VIP logo

xgdbvm's People

Contributors

jduvick avatar standage avatar vpbrendel avatar

Watchers

 avatar  avatar  avatar  avatar

xgdbvm's Issues

use of local CPU resources

The current setup does not use the full power of the local (VM) resources. In particular, GeneSeqer is always run in non-parallel mode, even on a VM with multiple processors.

Performance gains should be substantial by allowing the user an additional parameter when setting the Transcript Spliced Alignment options: "Number of processors". If set to a number >1 [default can be 1],
then the script splitMakearryGSQ.pl (could be renamed in the process!) could be asked to invoke GeneSeqerMPI instead of GeneSeqer.

Suggestion: assign parameter passing to Jon, GeneSeqerMPI implementation to Volker, independent testing to Daniel

Priority: not urgent, but attractive

Code indentation

The code quality / readability varies quite a bit throughout the xGDBvm codebase: we have everything from neatly organized code to quite-messy-held-together-by-duct-tape code and everything in between. In some cases, the indentation (or lack thereof) of the code makes it extremely difficult for the reader to understand.

Of course, we don't have the time or resources to go through and fix this comprehensively, but it may be a good idea whenever fixing bugs or adding new features to cleanup the relevant code, if needed. It may be a single code block, it may be an entire function, or maybe an entire file, depending on the scope of the bugfix/feature and the amount of time available.

Coding style involves a lot of stylistic/cosmetic decisions about brace placement, presence of whitespace around parentheses and commas, and so on. Most of these decisions are trivial in my opinion and have little effect on readability of code.

The one thing that DOES have a HUGE effect on readability is indentation. Inconsistent indentation doesn't just make the code hard to read, it can even suggest logical structure in the code that isn't really there, which takes extra work to comprehend.

I think a few simple guidelines would go a LONG way to improving the code. Again, this isn't something we need to implement with sweeping changes, but piecemeal as we are able.

  • Indentation should reflect logical flow/organization of the code. All of the following introduce logical structure into the program which should be reflected by indentation.
    • function definitions
    • conditionals (if statements)
    • loops (for, while, do)
  • All commands within a particular code block should have identical indentation.
  • Four white spaces is the preferred unit of indentation.
    • Two white spaces or a tab character are also common conventions.
    • It's less important which unit of indentation is used than it is to simply be consistent!

xGDBvm scripts can be fooled by false FAILED message from Agave

There is currently an issue with job status reporting via the Agave API that can cause the xGDBvm workflow to stop looking for output data from a remote job. This happens when a false FAILED status is returned to xGDBvm's webhook.php script, during a job that actually has not failed. In a pipeline-integrated job, this status when passed back to xGDB_Procedure.sh results in exiting a loop that would normally monitor for outputs in the user's job archive. Or, if a standalone job, it will simply result a false outcome displayed in the jobs.php listing. This 'false' FAILED status tends to happen to my jobs after SUBMITTING and before QUEUED status. Identical submissions may or may not 'falsely' FAIL. It's been reported to the Agave team, but in the meantime xGDBvm may report a job failure that is false. The current end user workaround would be to check for output from a FAILED run at /iplant/home/[username]/archive/jobs/[job-id]/ and if present, manually copy it to the input directory if further processing with a genome annotation is desired. If this Agave bug is not easily fixable, a technical workaround would be to modify xGDB_Procedure.sh to not react to a FAILED message unless it is not followed by any additional status messages over a specified amount of time. But hopefully this is a short-lived issue...!

GeneMark key doesn't copy correctly.

I get this message when trying to install GeneMark key.

failed to copy /xGDBvm/input/xgdbvm//keys/gm_key to /usr/local/src/GENEMARK/genemark_hmm_euk.linux_64/.gm_key... 

I don't know if this is a known issue.

Streamline update process

I just took a moment to spin up a new VM based on the new v1.13 image. I followed the quickstart instructions, which still include the update command which attempts to do a Subversion update. Since we've moved to git now, I edited the update script and was about to commit my changes when I noticed this had already by done and saved to a new update-vm script, which does the appropriate git update from GitHub.

I suggest we overwrite the update script with the new update-vm script and lose the second copy. There's really no reason I can think of for maintaining multiple copies.

Root ownership of the code on iPlant/CyVerse images

Having the xGDBvm repo under root ownership on the iPlant images makes for an awkward development environment: file management and git configuration must be done as the root user instead of the regular user.

I'm not sure what this accomplishes. It doesn't offer any "protection" from iPlant users: most will ignore the code anyway, and there is little to stop those that are curious from tinkering since they have root permissions. But perhaps there are some legitimate security concerns or other practical considerations that this permissions scheme addresses.

I brought this up recently in email, but figured I'd start a thread here so that the conversation doesn't slip through the cracks again.

And I don't really consider this a major issue that should affect any of our timelines, I just want to better understand why we're doing what we're doing.

'Smart' parsing of input GFF3 files so that IDs match the annotation FASTA files.

Background
In xGDBvm's annotation workflow, user-provided ~anno.gff3 file(s) are combined, parsed and loaded as 'Precomputed Gene Models', identified by a unique geneId, for display in the genome browser as an annotation track. xGDBvm also expects to load two associated sequence files, derived from the GFF3 file data: ~annot.mrna.fa (transcripts) and ~annot.pep.fa (translations). The purpose of including these files is to allow xGDBvm users to download or query (via Blast) annotation sequences on a batchwise or single sequence basis. For example, clicking on a gene model in the 'Genome Context Mode' of the xGDBvm genome browser brings up a 'Sequence Record' (via getRecord.pl), which displays summary information about that gene model, mostly from the parsed GFF3 table, but also (if available), a CDS translation from the indexed FASTA file (see screenshot) and a link to download the sequence (using returnFASTA.pl).

snapshot 3 1 16 5 58 pm

This functionality in 'getRecord.pl' depends on a unique database value geneId or transcript_id found in the GFF3-parsed table gseg_gene_annotation or cpgat_gene_annotation, that is matched by a FASTA identifier in the associated sequence file. The FASTA files are found under /xGDBvm/data/GDBnnn/data/BLAST/, and the requisite queries, paths, and hypertext additions are set by the DSO module SequenceTrack.pm.

The issue
Unfortunately, there is no guarantee that the geneId and/or transcript_id parsed from GFF3 will match the FASTA identifiers provided (although xGDBvm instructions caution users to make sure there is a match). This is often the case because GFF3 table may contain one or more unique identifiers, displayed variously as e.g. 'ID=', 'geneID=', 'Name=', 'transcript_id=', etc. If more than one identifier is present, the parsing script is programmed to choose from among these identifiers heirarchically, and it can't know which one is appropriate for matching a FASTA record.

So, short of requiring users to munge their data ahead of time to insure an ID match, we need some way to increase the probabililty that user-uploaded precomputed annotations will include the above-described ID match.

Possible solutions
xGDBvm already provides a sequence validation script (from validate_files.php and xGDB_ValidateFiles.sh) and encourages users to run it before initiating their annotation workflow. It includes a rudimentary QC step that compares the number of 'transcript' records in the GFF3 file vs the number of associated FASTA records, and sets a warning flag if the two are not equal.

So along these lines, one possible solution would be to extend the validation process to include an analysis of available ~annot.gff3, ~annot.mrna.fa and ~annot.pep.fa files, specifically to parse available transcript / translation ID types from each (as name:value pairs). Examples of these could then be displayed to the user on the configuration page and allow the user to select the correct ID type by clicking the appropriate radio button.

Other solutions could also be explored.

N-masked sequence may be ignored, no track feature created

A dataset with Volvox carterei genome scaffolds, containing N-spacers (3.7 % of total according to xGDBvm's file validation script), was found to be incorrectly parsed for what we term 'N-masked' regions.

Specifically, no N-masked regions were parsed by the script parseGsegMask.pl, resulting in and empty ~mask.fa file and a WARNING flag 6.40 in the Pipeline_Procedure.log. I can reproduce the problem but haven't found the source yet. One difference from the Example 1 benchmark for N-mask parsing is that V.carterei genome segments include lower case (gatc) bases, although the N-masked sequence interspersed is uppercase.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.