swiftstudies / oysterkit Goto Github PK

OysterKit is a framework that provides a native Swift scanning, lexical analysis, and parsing capabilities. In addition it provides a language that can be used to rapidly define the rules used by OysterKit called STLR

License: BSD 2-Clause "Simplified" License

Swift 99.95% Shell 0.05%

decoder language lexical-analysis parsing-capabilities swift

oysterkit's People

Contributors

Stargazers

Watchers

oysterkit's Issues

Streams are not lazy enough

Streams are currently using the AST node constructor which is maintaining a hierarchy of nodes. In reality streams should simply pass matched tokens as they are encountered.

Scanner branch optimisation

Provide a terminal tree scanner rule that optimises branched terminal searches. This may need to be done as an optimisation,

Complete all private documentation

Complete all private or fileprivate access scope documentation

Bug? Pinned tokens could be adopted by their parents

When a successful match results in no node OR transient for a HomogenousAST the children are hoisted to the parent. Is this behaviour correct?

Investigation should validate that

It is not at least a warning that a pinned node is also transient or resulted in no node being created by the constructor (perhaps a side effect of the complexity of this approach?)
That if there are pinned children we should not preserve the structure (I don't think so, unless the parent is also pinned)

Take a fresh look at RuleInstance

Seems like it could be used to dramatically simplify the implementation of ParsingRules...

You must declare the name of the grammar before any other declarations (e.g. grammar <your-grammar-name>) from 47 to 47

I'm trying to build an stlr file for swift comments (taken from "The Swift Programming Language" book).

This is my grammar file: swift.stlr

grammar SwiftComments

whitespace = whitespace-item whitespace?

whitespaceItem = lineBreak | comment | multiline-comment |
				  "\u0000" | "\u0009" | "\u000B" | "\u000C" | "\u0020"

lineBreak = "\u000A" | "\u000D" | "\u000D\u000A"

comment = "//" commentText lineBreak
multilineComment = "/*" multilineCommentText "*/"

commentText = commentTextItem commentText?
commentTextItem = /[^\r\n]/

multilineCommentText = multitlineCommentTextItem multitlineCommentText?
multilineCommentTextItem = (>> !"/*" | !"*/") (multilineComment | commentTextItem)

When I run stlrc -g swift.stlr I get the error in issue title. Any pointers to where I got this wrong?

Cache compiled regular expressions

Where a regular expression is used to represent a terminal, at this stage those regular expressions are recompiled each time the rule is referenced in the generated Swift code.

Those regular expressions should be identified and lazily compiled, but only once, rather than being recompiled on each reference

Support for Carthage

It'd be great to have this framework support Carthage.

Using OysterKit for syntax highlight

I'm pretty new to this language creation thing using tools like STLR. I was wondering if OysterKit and STLR would be of any help in creating a simple editor with rudimentary syntax highlight (by recognizing language node types and providing location information in source string).

Evaluate documentation generation tools

List as it stands

Jazzy
Other ideas?

API Clean Up: Rationalize protocol requirements for HomogenousAST

HomogenousAST has two properties children and tokens which are off the same type. They are driven by two different protocols which is fine, but they need to have their terminology consolidated so that it becomes less ambiguous what these properties are.

Complete all public documentation

Will break this out if people want to help!

Streams do not respect @void, @pin, or @transient annotations

They should

Complete all internal documentation

Add documentation for all internally scoped symbols in OysterKit

Local absolute path in project file

I don't think absolute paths are a good idea. Found this in the project file, trying to guess how to use OysterKit.

/Users/nhughes/Documents/Code/XCode/GitHub/OysterKit/Mac/OysterKit/../../Common/Framework;

Bork tutorial example always fails with obscure error

Hi, thanks for the great library!

When I go through the tutorial, I copy the example grammar exactly as given:

//
// A grammar for the Bork text-adventure
//


// Vocabulary
//
@pin verb        = "INVENTORY" | "GO" | "PICKUP" | "DROP" | "ATTACK"
@pin noun        = "NORTH" | "SOUTH" | "KITTEN" | "SNAKE" | "CLUB" | "SWORD"
@pin adjective   = "FLUFFY" | "ANGRY" | "DEAD"
@pin preposition = "WITH" | "USING"

// Commands
//
subject     = (adjective .whitespaces)? noun
command     = verb (.whitespaces subject (.whitespaces preposition .whitespaces subject)? )?

to a file Bork.stlr

After I test this grammar with swift run stlrc -g Bork.stlr I always get this error with any of the tutorial test strings:

Parsing failed: 
constructionFailed([])

The error doesn't give any info on what exactly failed. OysterKit code is from master branch.

Offer a brew install option for command line tool

Following how Swiftenv (https://swiftenv.fuller.li/en/latest/installation.html) does the brew install setup, I would love to be able to install the command line interface via brew instead of having to worry about cloning or using it on some variance of a path.

Rework error handling

The new stack and rule system enables far more simple error handling, but the code is still littered with old special cases that are no longer required.

Before release of v1 this should be cleaned up to use simple hierarchical errors and all error handling removed from IRs and parsing strategies.

Incorrect error raised when STLR parsing fails after at least one rule

This will require the .endOfFile character class is implemented to ensure that parsing has continued to the end of the file.

Update STLR tutorial

Probably best left until 0.7 but this can be further improved.

Won't compile in Swift 1.0

needs an upgrade

Add quick help for all public types for scanning and lexing

In order to test documentation generation one section of OysterKit should be fully documented for Quick Help and other entities supported by Jazzy.

Complete public documentation

Refactoring has left a few functions undocumented. These need to be finished and added

Improve parsing error output from stlrc

Currently dumps a giant lump of causes. Format this more cleanly and trim as much as possible.

OysterKit doesn't parse mix of one an two letters identifiers

Let's say we want to parse different identifiers composed from two letters.

grammar Example

identifier = "A" | "B" | "Aa" | "Ab" | "Ba" | "Bb"

Then this test

try Example.build("Ab")

throws an error: Interpretation Error: AST construction failed.

Increase unit test coverage

As above general coverage needs to get up into the 80's before 1.0

Improve syntactic sugar for @transient and @void

At the moment the "-" suffix signifies consume (poorly defined). The following modifiers should be defined

"-" the generated token should be void
"~" the generated token should be transient

pinned terms as regex

Make pinned terms able to be regex like traditional lexers

Remove internal dependencies on OysterKit deprecated API

There are a few in OysterKit, many in STLR, and over 50 within the tests themselves.

This API will not be removed immediately but is guaranteed to be gone by v1.0

Wrong data structure created for array of choices

First example

It generates code that doesn't compile:

grammar Example

offOrOn = ("0" | "1")+

generates:

public struct Example: Codable {
    public let offOrOn : OffOrOn
    ...
}

Second example

It generates the structure that won't be able to represent the parsed string:

grammar Example

off = "0"
on = "1"

mix = (off | on)+

result:

public struct Example: Codable {
    
    // Mix
    public enum Mix: Codable {
        case on(on:[Swift.String])
        case off(off:[Swift.String])
    }
    ...
}

Revisit extensions for building rules

Consider over loading for | or + then ! Not and generally review what is present

STLR - Sub transient evaluation optimiser

When a transient token is encountered all children are disposed of. An optimiser could be created to ensure that those children are not created in the first place so that no performance penalty is incurred.

An optimiser could mark all children as void or transient (investigate as this will impact the preservation of ranges).

Optimizer makes void tokens transient

The optimiser currently makes a void rule that can be reduced into a single terminal rule produce a transient token with no void annotation.

Support parsing a plain byte stream

Based on the examples and the definition of a parse() function it looks like it's only possible to parse strings, not the more generic Data. I think it would be a nice feature if it would be possible to run the parser over a generic Data as this would allow to parse more complex grammars where some rules may define a blob array or something like that.

Bork example errors with preposition

I pulled the Bork repo, as well as my own, and when I run the example: "ATTACK SNAKE WITH SWORD" im getting the following error:

keyNotFound(CodingKeys(stringValue: "noun", intValue: nil), Swift.DecodingError.Context(codingPath: [CodingKeys(stringValue: "secondSubject", intValue: nil)], debugDescription: "No value associated with key CodingKeys(stringValue: \"noun\", intValue: nil) (\"noun\").", underlyingError: nil))

// Lexer

@pin verb         = "INVENTORY" | "GO" | "PICKUP" | "DROP" | "ATTACK"
@pin noun         = "NORTH" | "SOUTH" | "KITTEN" | "SNAKE" | "CLUB" | "SWORD"
@pin adjective    = "FLUFFY" | "ANGRY" | "DEAD"
@pin preposition  = "WITH" | "USING"

// Commands
subject = (adjective .whitespace)? noun
command = verb (.whitespace subject (.whitespace preposition .whitespace @token("secondSubject") subject)? )?

Just like your tutorial.

Thanks!

Finalize changes to STLR language

Remove some of the old transient tokens that are no longer needed, and make the new ones that can be used for error propagation and handling standard instead of custom.

Refactor token streaming

At the moment there is little difference in implementation (and therefore potentially expected memory/performance profiles as well as behavioural characteristics) for streams, homogenous and heterogenous ASTs.

These areas could be refactored to improve

Ease of consumption
Provided the expected benefits (lazy and therefore low memory consumption)

In order to do this the following should be provided:

Streaming should identify "trigger" tokens that will be forwarded, but no others. Consideration will have to be given to range management on the node stack (or a new equivalent) but nodes should not be created
There is an attempt to leverage commonality in behaviour between homogenous and heterogenous AST generation. This should be preserved but without the complexity of having to provide an IR and a constructor and still being left in a situation where a lot of casting has to be done

Broken link in the documentation - Simple Basic

A link in the Simple Basic example opens "No page found":

Nested expression matching

One awesome thing would be ability to nest pattern matches/calls within your grammar, like a PEG, Antlr, or even a Flex/Bison:

Example:

statement = [a-zA-Z0-9]+
statements = statement + statements

It might do this, just didn't see anything in the docs about it.

ps. awesome work on regex support!

Bork tutorial broken

Hey there,

i am not sure if this is still maintained but i tried to follow the steps in the Bork tutorial and it seems to be broken.

When defining the grammar stlrc breaks because the name of the grammar is not defined (missing in tutorial)
When running stlrc with the defined Bork grammar, no output is provided (macOS 10.15.4, Swift 5.2.2)
After editing the main.swift file to include the user input, compiling fails as the generated Bork.swift module does not include the parse method:

I hope this is still maintained! Thank you :)

OysterKit.swift missing

I converted the code to use Swift 1.2, and also fixed the absolute path in the project to get things compiling.

One thing I'm not sure I can work around though, the iOS version of the OysterKit framework includes the file "OysterKit.swift", which is not in the project anywhere. I note that the Mac version of the framework has OKStandard.swift - is that a new name for the same file?

Lost identifier annotations in dynamic language generation

In the dynamic language extension of the STLRIntermediateRepresentation.GrammarRule

fileprivate func rule(from grammar:STLRIntermediateRepresentation, 
                                  inContext context:GenerationContext, 
                                  creating token:Token? = nil, 
                                  annotations: RuleAnnotations)->Rule?

It's a bit of a mess really. There are two code paths one for LHR recursive rules and one for rules that aren't. In both cases if the expression for the grammar rule is just a single element that does generate a token the annotations are stripped from the new identifier (they either propagate into the expression or are lost).

I think the right solution is to make an identifier declaration rule different to others in that it is a rule with the token (the identifier) and the annotations on that declared identifier, but the MATCHER is taken from the expression's rule. You would have to be careful you don't end up with double tokens. If the token is transient then there isn't a problem.

GENERATED EXPRESSION'S RULE HAS A NON-TRANSIENT TOKEN OR ITS OWN ANNOTATIONS

WrappingRule(identifier, identifier annotations).matcher =
identifier(annotations).matcher = sequence(creating expression's token, annotated with the expression's annotations) of a single rule using the expressions matcher but stripped of its annotations and transient

GENERATED EXPRESSION'S RULE HAS A NON-TRANSIENT TOKEN

WrappingRule(identifier, identifier annotations).matcher = Expression's matcher

Update stlrc to use TerminalKit

CommandKit is heavy and fundamentally hard to understand. TerminalKit is simpler (and somewhat less powerful) but is more than enough and will enable easier extension.

https://github.com/SwiftStudies/TerminalKit

Add .backslash as predefined character in STLR rules

With the escaping in STLR, specifying a backslash "\\" can detract from the readability. Would be handy to just use .backslash

Update BORK tutorial for 0.6+

Can be dramatically simplified

STLR: using singular term for predefined characters

Use singular for predefined sets (.decimalDigit, .letter vs. .decimalDigits, .letters)

ICU specifies character categories in singular terms (\p{Decimal}, \p{Letter}) with the quantifier being separate (+, *, ?, {n,m})

SLTR rule also uses modifiers (?,*,+) to specify quantity and seems more in line with a Regular Expression type of declaration. Plural term in the definition seems to imply multiple of given category, when it's only one of N. When you read the STLR

number = .decimalDigits

One could infer it to match the number "123" when in fact it will only match the "1"

Examples of the singular w/ modifiers:

digits = .decimalDigit+ // one or more of a decimal digit
ows = .whitespace? // optional whitespace char

Suggestion to rename .whitespaces to .whitespace or .whitespaceOrTab or to change their behaviour

When reviewing #47 I realised that .whitespaces naming doesn't play well with .whitespaces*, .whitespaces+ or .whitespace?. It implies that many whitespaces may be matched, while only one is matched.

It could be better for library user experience to either rename it or make it match multiple symbols.
Possible naming: .whitespace, .whitespaceChar, .whitespaceCharacter, .whitespaceOrTab.

Same actually could be applied to .newlines, which I think matches only one newline character.

Overall, looking at character sets, singular/plural naming is inconsistent at a first glance. I understand that it applies to the number of characters in a character set, but users of STLR might not understand it this way (like it happened to me). I think that communicating how many characters will be matched is more important than how many characters are in a set.