amyris / gslcore Goto Github PK

View Code? Open in Web Editor NEW

21.0 7.0 9.0 923 KB

Core library and basic plug-ins for the Amyris Genotype Specification Language (GSL) compiler.

License: Apache License 2.0

Batchfile 0.01% F# 98.20% Shell 0.01% HTML 1.79%

gsl genotype-specification-language synthetic-biology biology bioinformatics compiler genetics

gslcore's Introduction

Genotype Specification Language (GSL) Core library

Amyris domain specification language for rapidly specifying genetic designs

Community

Discussion of the language, development and usage is in this google group.

Documentation

Documentation in the repo is sparse currently, but you can find

the scientific paper describing the language Genotype Specification Language
A springer book chapter Engineering Genomes with Genotype Specification Language
GSL documentation as part of the Autodesk genetic constructor tool here
the press release on the Amyris GSL / Autodesk collaboration here

This library provides all core modules of the compiler, template code for constructing an application, and a set of plug-ins providing basic core functionality.

NB: the actual compiler front end is in this repository and can be built from a distributed packaged version of GslCore without building this repository.

gslcore's People

Contributors

Stargazers

Watchers

Forkers

demetrixbio jamessdixon dmitry-a-morozov forki legezam curseoff chrismacklin yzupnick

gslcore's Issues

Merge Gslc into GslCore

Hey @daz10000,
I'm starting the work of releasing our integration tests and from what I see currently they would need to live in Gslc repo (https://github.com/Amyris/Gslc). I know each of our companies probably have an internal version of that repo that does our own peculiar and proprietary things. That said, as illustrated in this #1 coming up to speed with Gsl as an outside is a real tough lift because the relationship between the repos is not clearly defined and there are no good docs or examples on how to get setup. My perhaps naive proposal is to move the https://github.com/Amyris/Gslc/tree/master/src/Gslc dotnet project in to https://github.com/Amyris/GslCore/tree/master/src. Chris Dolan did some work internally to port that project to dotnetcore from the old mono setup. After that, we could push our integration tests into a test directory in the root of GslCore and they could be run on every change to GslCore.

Do you think this is a good idea? Happy to entertain alternatives if you have better ideas!

BUG: variable self reference causes stack overflow

This (admittedly buggy) GSL causes a stack overflow on our production compiler. Please verify of course but pretty sure it's a bug in GslCore


let x = oGAL1

let y = &y ; &x

How to get started building GslCore, relationship to Gslc

Hi!

I'm taking over the Autodesk integration of GSL into the Genetic Constructor. I'm a little confused on the relationship between https://github.com/AmyrisInc/GslCore/ and https://github.com/AmyrisInc/Gslc. I can build the Gslc locally, but not the GslCore. The GslCore appears to contain features needed for primer generation.

Dion

FEA: Allow annotation of DNA topology

Intro

GSL designs are inherently linear in specification. Many of the actual pieces of DNA that scientists are constructing are circular and contain additional implicit DNA sequences in the final structure. The actual process of circularisation usually arises during construction (i.e. downstream of the design), but are inherently a part of the DNA design and it would be ideal to capture that intention.

Proposal

Add a global persistent pragma that specifies the DNA topology in a way that would be associated with any ongoing designs. Associate the topology (linear or circular) with the assembly. Update any output format generators to make any appropriate adjustments in format (most DNA structure file formats have some way of representing linear or circular) to reflect the desired topology.

E.g.


// This would be linear DNA
uFOO4; oERG9 ; dFOO4

#topology circular

// these would be emitted as circular designs e.g. plasmids
gTDH3; oERG6 ; tTDH3
gTDH1; oERG6 ; tTDH3

#topology linear
// this would be emitted as a linear design
gTDH2 ; oERG6 ; rERG7

Questions

Is it worthwhile to put the topology directly into the assembly data structure or can we employ check for the presence of the pragma in each output generator. I believe it would be cleaner to make it an explicit property but open to doing it both ways.
I will file us as a related issue, but this also brings up the question of how the DNA is packaged. It is extremely common to introduce flanking sequences when DNA is actually constructed e.g. to clone into a vector. Representing the design as a circular piece of DNA without including these extra sequences in at least some of the generated material is incomplete. I will file us as a separate issue but they are related.

FEA: gff3 support for reference genomes

The reference gene format used by the GSL compiler is considered a legacy format.. It was originally derived from an SGD export and even that was obscure. We further complicated things by subtracting one from every coordinates making it zero-based which is unusual in user facing biology systems.

One proposal is to replace this legacy format with the GFF3 standard which is a relatively common format in biology. It suffers from some standardisation issues but should allow rich expression of gene structure information with coordinates, be more interoperable with other bioinformatics tools and also allow combining the fasta sequence and the coordinates into a single file optionally

In order to implement these changes we would need to provide a

GFF3 parser that can replace the current ref format loader
A tool to convert existing ref files into the GFF3 format
optionally a tool for validating GFF3 files, since they can be non-standard especially with respect to where the gene identifiers are stored

It might also be desirable to make the format loader a configurable option, to enable future formats. It's questionable whether we should retain support for the existing reference files or just forget about them as a bad memory and encourage migration to GFF3 ;)

I know there is an existing F sharp implementation of a GFF3 parser and if that were released it would save some effort. I have code for generating GFF3 files and could quickly write the conversion and validation tool.

Do we wish to combine the DNA sequence and cordons into a single file? The advantages are that there is just one file floating around with the whole genome, and possibly slightly faster loading. You can also ensure that the coordinates and reference sequence stay together. The disadvantage is that it's harder to get a copy of just the fasta file for other analyses, although a conversion tool for that would also be possible.

In terms of interface with the GSL compiler, we could initially create a loader that plugs into the existing Feature data structure, so the majority of the compiler would be untouched by this upgrade. It would be desirable to expand the data structure to capture things like intron/exon coordinates (note we have largely lost these from existing ref files although in theory they could be there). This would enable more intelligent processing of things like open reading frames in the future, but that's a bigger change to the core compiler.

BUG: combining inline expansion and variable ref to .up or .down causes internal error

Using a variable reference with a qualification for a subpart e.g. &locus.up inside a function which also uses inline sequences causes the internal error shown below. GSL internally creates GSL it can't process

E.g. this GSL

#refgenome S288C
#platform stitch
#linkers 0,2,3,9|
let func(locus)=
    &locus.up ; /$HI/ ; &locus.down
end
func(gYNG2)

emits this error with the compiler.

ParserError: near line 1 col 9
syntax error; found ';', expected one of ['>'].
=================================================================
#refgenome S288C
        ^
let func(locus)=
    &locus.up ; /$HI/ ; &locus.down
end

func(gYNG2)

InternalError: near line 3 col 5
An error occurred while parsing this internally-generated GSL source code:
gYNG2.up;/CACATC/ ;gYNG2.down
=================================================================
#refgenome S288C
let func(locus)=
    &locus.up ; /$HI/ ; &locus.down
    ^
end

func(gYNG2)

In contrast, this code

#refgenome S288C
#platform stitch
#linkers 0,2,3,9|
let func(locus)=
    &locus.up ; /$HI/ ; &locus.down
end

func(gYNG2)

compiles without issue. I think it's related to the order of expansion but I don't see why processing &locus.up later should make any difference.

Syntax for Concatenating Strings in GSL

This is a request to implement syntax that enables a user to concatenate two strings in GSL. The proposed functionality would work as follows:

let a = “foo”
let b = “bar”
let ab = a + b

ab should then be a binding to a string “foobar”.

@daz10000 @legezam

BUG/FEA: line number tracking for nested functions loses interior coordinates

With the multiple source position tracking feature, the call to functionThree in the source below records the line 4 final position and the call on line but misses line 15 and line 8. There is a test to assert the number of coordinates is 4 TestNestedFunctionExpansionHasThreeSources which is currently marked ignore. This test should ideally pass (and have its tag removed)

#refgenome cenpk
#platform stitch

let fun1(up,down) = // line 3 (zero numbered)
&up ; &down // line 4
end

let funTwo(gene) = // line 7 
fun1(&gene,&gene) // line 8
end

fun1(uADH1,dADH1) // line 11
funTwo(uADH4) // line 12

let functionThree() = // line 14
 funTwo(uADH7) // line 15
end


         functionThree() // line 19

FEA: representation for flanking DNA during construction

Flanking DNA

GSL typically models just the designed DNA (e.g. locus, promoter gene terminator constructs), but during the process of construction it is very common to package the design into a larger piece of DNA e.g. typically a plasmid construct, at which point there are flanking DNA sequences upstream and downstream of the design. We wish to provide some system for describing the richer representation, without completely breaking the separation between design and implementation.

Proposal

Introduce a syntax for specifying a packaging construct that can introduce adjacent DNA sequences.
Enable output generators to create the abstract design and/or the fully packaged design

As an example implementation, the user could create a GSL function that takes a design as input, and returns a more elaborate design with the flanking sequences. We would need a mechanism for flagging to the compiler that this function is needed for packaging design. It's not uncommon to have multiple packaging systems, so we would also need a mechanism at the design level for selecting the packaging function. Finally, output generators (that care) would need to decide whether to generate the abstract form of the DNA, the packaged form of the DNA or optionally both. This might be a compiler level decision, in which case we could somewhat abstract the problem away from the individual generators by making two passes through the output generation.

Example

let myPackager(assembly) =
    /ATGATGCTAGTCGTACGTAGTCAGT/ ; &assembly ; /TGATCGTACGTAGTCGTACGTACGTA/
end


#packager myPackager

uFOO5; oERG10 ; dFOO5

Semantic versioning of GSL itself

It would be beneficial to have GSL itself be versioned separately from the compiler version.

How can a fool like me to start use this software

Dear All,

I would like to try this cool tool. However, I am nearly a fool in programming.

Do you have a fool-proof step-by-step manual?

Best regards,

Bingyin Peng

QUT

FEA: support for explicit tail length in seamless design

Currently seamless designs have somewhat adhoc methods for controlling primer design. The amplification tm is controlled by the core targettm pragma but the tail length is a side effect of the seamlessoverlaptm optimization and constrained to be shorter than the sum of the length of the fwd and rev primers for amplification. This code is controlled in PrimerDesign.fs in the seamless function.

current logic

Is approximately

Allocate half of the primer to forward and reverse

let maxFwd = dp.pp.maxLength / 2
let maxRev = dp.pp.maxLength - maxFwd

Design fwd and rev amp regions. There are small optimizations in case primer design fails on one side and not the other, reallocating capacity to one side or the other.

 //                       -------> (fwd)
 //   ===================|====================
 //          <----------- (rev)
 //

Now design tails in the success function with f and r oligos as material. This plays with the tails of the primers to get the overlap Tm correct. This is where the possibly suboptimal planning occurs. Note the tail length can never exceed the length of the original fwd and rev primers that were designed. It makes it impossible for example to have a normal amplification tm e.g. 60 and long (60bp) tails.

 // Get optimal tail lengths ( ~~~s below) to get the right internal seamlessoverlaptm
        let bestF,bestR =
            optOverlapTm
                (min (dp.overlapMinLen/2) f.oligo.Length)
                (min (dp.overlapMinLen/2) r.oligo.Length)
                f.oligo.Length
                r.oligo.Length
                9999.0<C>

Proposal

Add an extra #seamlesstaillen pragma to allow control of the tail length. If that pragma is available, the tails are simply set to explicitly requested length with caveat that it can not exceed the primermaxlen constraint. It would override the seamlessoverlaptm constraint if present. The main complication implementing this is that currently the fwd and rev primers are the only passed in material that the success function can use. To exceed the original primer body length, we would need to pass in a longer runway to the success function in addition to the primers. The template / substrate is available at time of primer design in the designSingle function. One final complication to be aware of is that the design with variable ends (approxEnds) can result in a primer which isn't a simple prefix of the template.

Example code

#warnoff zeronine
#platform stitch
#linkers MT,MT|

#primermax 120
#overlapminlen 80
uHO {#fuse} ; dHO

Fwdbody is 39 bp, Fwdtail is 34bp, overlap is 73bp

FEA parameterless functions

This almost feels like a bug. The following code gives an error. Basically you can't have a function that just expands into code. Of course you can get around this with part definitions but it seems like an omission.

#refgenome cenpk

#platform stitch
let foo() =
    uADH1 ; dADH1
end

foo()

syntax error; found ')', expected one of ['identifier'].
=================================================================
#platform stitch
let foo() =
        ^
    uADH1 ; dADH1
end

foo()

amyris / gslcore Goto Github PK

gslcore's Introduction

Genotype Specification Language (GSL) Core library

Community

Documentation

gslcore's People

Contributors

Stargazers

Watchers

Forkers

gslcore's Issues

Intro

Proposal

Questions

Flanking DNA

Proposal

Example

current logic

Proposal

Example code

Recommend Projects

Recommend Topics

Recommend Org