brendelgroup / xgdbvm Goto Github PK
View Code? Open in Web Editor NEWCode for the xGDBvm iPlant Atmosphere enabled genome annotation platform
Home Page: http://brendelgroup.github.io/xGDBvm/
License: GNU General Public License v3.0
Code for the xGDBvm iPlant Atmosphere enabled genome annotation platform
Home Page: http://brendelgroup.github.io/xGDBvm/
License: GNU General Public License v3.0
The current setup does not use the full power of the local (VM) resources. In particular, GeneSeqer is always run in non-parallel mode, even on a VM with multiple processors.
Performance gains should be substantial by allowing the user an additional parameter when setting the Transcript Spliced Alignment options: "Number of processors". If set to a number >1 [default can be 1],
then the script splitMakearryGSQ.pl (could be renamed in the process!) could be asked to invoke GeneSeqerMPI instead of GeneSeqer.
Suggestion: assign parameter passing to Jon, GeneSeqerMPI implementation to Volker, independent testing to Daniel
Priority: not urgent, but attractive
The code quality / readability varies quite a bit throughout the xGDBvm codebase: we have everything from neatly organized code to quite-messy-held-together-by-duct-tape code and everything in between. In some cases, the indentation (or lack thereof) of the code makes it extremely difficult for the reader to understand.
Of course, we don't have the time or resources to go through and fix this comprehensively, but it may be a good idea whenever fixing bugs or adding new features to cleanup the relevant code, if needed. It may be a single code block, it may be an entire function, or maybe an entire file, depending on the scope of the bugfix/feature and the amount of time available.
Coding style involves a lot of stylistic/cosmetic decisions about brace placement, presence of whitespace around parentheses and commas, and so on. Most of these decisions are trivial in my opinion and have little effect on readability of code.
The one thing that DOES have a HUGE effect on readability is indentation. Inconsistent indentation doesn't just make the code hard to read, it can even suggest logical structure in the code that isn't really there, which takes extra work to comprehend.
I think a few simple guidelines would go a LONG way to improving the code. Again, this isn't something we need to implement with sweeping changes, but piecemeal as we are able.
Dear Team,
I am trying to Install " xGDBvm" in my server.
The tutorial to install the server is not opening : http://goblinx.soic.indiana.edu/wiki/doku.php?id=tutorials
Kindly rectify the issue.
Thank you
sridhar
There is currently an issue with job status reporting via the Agave API that can cause the xGDBvm workflow to stop looking for output data from a remote job. This happens when a false FAILED
status is returned to xGDBvm's webhook.php
script, during a job that actually has not failed. In a pipeline-integrated job, this status when passed back to xGDB_Procedure.sh
results in exiting a loop that would normally monitor for outputs in the user's job archive. Or, if a standalone job, it will simply result a false outcome displayed in the jobs.php
listing. This 'false' FAILED status tends to happen to my jobs after SUBMITTING and before QUEUED status. Identical submissions may or may not 'falsely' FAIL. It's been reported to the Agave team, but in the meantime xGDBvm may report a job failure that is false. The current end user workaround would be to check for output from a FAILED run at /iplant/home/[username]/archive/jobs/[job-id]/
and if present, manually copy it to the input directory if further processing with a genome annotation is desired. If this Agave bug is not easily fixable, a technical workaround would be to modify xGDB_Procedure.sh
to not react to a FAILED message unless it is not followed by any additional status messages over a specified amount of time. But hopefully this is a short-lived issue...!
I get this message when trying to install GeneMark key.
failed to copy /xGDBvm/input/xgdbvm//keys/gm_key to /usr/local/src/GENEMARK/genemark_hmm_euk.linux_64/.gm_key...
I don't know if this is a known issue.
I just took a moment to spin up a new VM based on the new v1.13 image. I followed the quickstart
instructions, which still include the update
command which attempts to do a Subversion update. Since we've moved to git now, I edited the update
script and was about to commit my changes when I noticed this had already by done and saved to a new update-vm
script, which does the appropriate git update from GitHub.
I suggest we overwrite the update
script with the new update-vm
script and lose the second copy. There's really no reason I can think of for maintaining multiple copies.
Having the xGDBvm repo under root ownership on the iPlant images makes for an awkward development environment: file management and git configuration must be done as the root user instead of the regular user.
I'm not sure what this accomplishes. It doesn't offer any "protection" from iPlant users: most will ignore the code anyway, and there is little to stop those that are curious from tinkering since they have root permissions. But perhaps there are some legitimate security concerns or other practical considerations that this permissions scheme addresses.
I brought this up recently in email, but figured I'd start a thread here so that the conversation doesn't slip through the cracks again.
And I don't really consider this a major issue that should affect any of our timelines, I just want to better understand why we're doing what we're doing.
Background
In xGDBvm's annotation workflow, user-provided ~anno.gff3
file(s) are combined, parsed and loaded as 'Precomputed Gene Models', identified by a unique geneId
, for display in the genome browser as an annotation track. xGDBvm also expects to load two associated sequence files, derived from the GFF3 file data: ~annot.mrna.fa
(transcripts) and ~annot.pep.fa
(translations). The purpose of including these files is to allow xGDBvm users to download or query (via Blast) annotation sequences on a batchwise or single sequence basis. For example, clicking on a gene model in the 'Genome Context Mode' of the xGDBvm genome browser brings up a 'Sequence Record' (via getRecord.pl
), which displays summary information about that gene model, mostly from the parsed GFF3 table, but also (if available), a CDS translation from the indexed FASTA file (see screenshot) and a link to download the sequence (using returnFASTA.pl).
This functionality in 'getRecord.pl' depends on a unique database value geneId
or transcript_id
found in the GFF3-parsed table gseg_gene_annotation
or cpgat_gene_annotation
, that is matched by a FASTA identifier in the associated sequence file. The FASTA files are found under /xGDBvm/data/GDBnnn/data/BLAST/
, and the requisite queries, paths, and hypertext additions are set by the DSO
module SequenceTrack.pm
.
The issue
Unfortunately, there is no guarantee that the geneId
and/or transcript_id
parsed from GFF3 will match the FASTA identifiers provided (although xGDBvm instructions caution users to make sure there is a match). This is often the case because GFF3 table may contain one or more unique identifiers, displayed variously as e.g. 'ID=', 'geneID=', 'Name=', 'transcript_id=', etc. If more than one identifier is present, the parsing script is programmed to choose from among these identifiers heirarchically, and it can't know which one is appropriate for matching a FASTA record.
So, short of requiring users to munge their data ahead of time to insure an ID match, we need some way to increase the probabililty that user-uploaded precomputed annotations will include the above-described ID match.
Possible solutions
xGDBvm already provides a sequence validation script (from validate_files.php
and xGDB_ValidateFiles.sh
) and encourages users to run it before initiating their annotation workflow. It includes a rudimentary QC step that compares the number of 'transcript' records in the GFF3 file vs the number of associated FASTA records, and sets a warning flag if the two are not equal.
So along these lines, one possible solution would be to extend the validation process to include an analysis of available ~annot.gff3
, ~annot.mrna.fa
and ~annot.pep.fa
files, specifically to parse available transcript / translation ID types from each (as name:value pairs). Examples of these could then be displayed to the user on the configuration page and allow the user to select the correct ID type by clicking the appropriate radio button.
Other solutions could also be explored.
A dataset with Volvox carterei genome scaffolds, containing N-spacers (3.7 % of total according to xGDBvm's file validation script), was found to be incorrectly parsed for what we term 'N-masked' regions.
Specifically, no N-masked regions were parsed by the script parseGsegMask.pl
, resulting in and empty ~mask.fa
file and a WARNING flag 6.40 in the Pipeline_Procedure.log. I can reproduce the problem but haven't found the source yet. One difference from the Example 1 benchmark for N-mask parsing is that V.carterei genome segments include lower case (gatc) bases, although the N-masked sequence interspersed is uppercase.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.