The reference gene format used by the GSL compiler is considered a legacy format.. It was originally derived from an SGD export and even that was obscure. We further complicated things by subtracting one from every coordinates making it zero-based which is unusual in user facing biology systems.
One proposal is to replace this legacy format with the GFF3 standard which is a relatively common format in biology. It suffers from some standardisation issues but should allow rich expression of gene structure information with coordinates, be more interoperable with other bioinformatics tools and also allow combining the fasta sequence and the coordinates into a single file optionally
In order to implement these changes we would need to provide a
- GFF3 parser that can replace the current ref format loader
- A tool to convert existing ref files into the GFF3 format
- optionally a tool for validating GFF3 files, since they can be non-standard especially with respect to where the gene identifiers are stored
It might also be desirable to make the format loader a configurable option, to enable future formats. It's questionable whether we should retain support for the existing reference files or just forget about them as a bad memory and encourage migration to GFF3 ;)
I know there is an existing F sharp implementation of a GFF3 parser and if that were released it would save some effort. I have code for generating GFF3 files and could quickly write the conversion and validation tool.
Do we wish to combine the DNA sequence and cordons into a single file? The advantages are that there is just one file floating around with the whole genome, and possibly slightly faster loading. You can also ensure that the coordinates and reference sequence stay together. The disadvantage is that it's harder to get a copy of just the fasta file for other analyses, although a conversion tool for that would also be possible.
In terms of interface with the GSL compiler, we could initially create a loader that plugs into the existing Feature data structure, so the majority of the compiler would be untouched by this upgrade. It would be desirable to expand the data structure to capture things like intron/exon coordinates (note we have largely lost these from existing ref files although in theory they could be there). This would enable more intelligent processing of things like open reading frames in the future, but that's a bigger change to the core compiler.