Giter VIP home page Giter VIP logo

parsus's Introduction

Parsus

Maven Central License: MIT Gradle build

A framework for writing composable parsers for JVM, JS and Kotlin/Native based on Kotlin Coroutines.

val booleanGrammar = object : Grammar<Expr>() {
    init { regexToken("\\s+", ignored = true) }
    val id by regexToken("\\w+")
    val lpar by literalToken("(")
    val rpar by literalToken(")")
    val not by literalToken("!")
    val and by literalToken("&")
    val or by literalToken("|")
    val impl by literalToken("->")

    val variable by id map { Var(it.text) }
    val negation by -not * ref(::term) map { Not(it) }
    val braced by -lpar * ref(::root) * -rpar

    val term: Parser<Expr> by variable or negation or braced

    val andChain by leftAssociative(term, and, ::And)
    val orChain by leftAssociative(andChain, or, ::Or)
    val implChain by rightAssociative(orChain, impl, ::Impl)

    override val root by implChain
}

val ast = booleanGrammar.parse("a & (b1 -> c1) | a1 & !b | !(a1 -> a2) -> a").getOrThrow()

Usage

Using with Gradle for JVM projects
dependencies {
    implementation("me.alllex.parsus:parsus-jvm:0.6.1")
}
Using with Gradle for Multiplatform projects
kotlin {
    sourceSets {
        val commonMain by getting {
            dependencies {
                implementation("me.alllex.parsus:parsus:0.6.1")
            }
        }
    }
}
Using with Maven for JVM projects
<dependency>
  <groupId>me.alllex.parsus</groupId>
  <artifactId>parsus-jvm</artifactId>
  <version>0.6.1</version>
</dependency>

Features

  • 0-dependencies. Parsus only depends on Kotlin Standard Library.
  • Pure Kotlin. Parsers are specified by users directly in Kotlin without the need for any codegen.
  • Debuggable. Since parsers are pure non-generated Kotlin, they can be debugged like any other program.
  • Stack-Neutral. Leveraging the power of coroutines, parsers are able to process inputs with arbitrary nesting entirely avoiding stack-overflow problems.
  • Extensible. Parser combinators provided out-of-the-box are built on top of only a few core primitives. Therefore, users can extend the library with custom powerful combinators suitable for their use-case.
  • Composable. Parsers are essentially functions, so they can be composed in imperative or declarative fashion allowing for unlimited flexibility.

There are, however, no pros without cons. Parsus relies heavily on coroutines machinery. This comes at a cost of some performance and memory overhead as compared to other techniques such as generating parsers at compile-time from special grammar formats.

Quick Reference

This is a reference of some of the basic combinators provided by the library.

There is a combinator available in both procedural-style and combinator-style grammars. You can pick and choose the style for each parser and sub-parser, as there are no restrictions.

Description Grammars
Parsing a token and getting its text Parses: ab, aB Procedural:
val ab by regexToken("a[bB]")
override val root by parser {
    val abMatch = ab()
    abMatch.text
}
Combinator:
val ab by regexToken("a[bB]")
override val root by ab map { it.text }
Parsing two tokens sequentially Parses: ab, aB Procedural:
val a by literalToken("a")
val b by regexToken("[bB]")
override val root by parser {
    val aMatch = a()
    val bMatch = b()
    aMatch.text to bMatch.text
}
Combinator:
val a by literalToken("a")
val b by regexToken("[bB]")
override val root by a and b map
    { (aM, bM) -> aM.text to bM.text }
Parsing one of two tokens Parses: a, b, B Procedural:
val a by literalToken("a")
val b by regexToken("[bB]")
override val root by parser {
    val abMatch = choose(a, b)
    abMatch.text
}
Combinator:
val a by literalToken("a")
val b by regexToken("[bB]")
override val root by a or b map { it.text }
Parsing an optional token Parses: ab, aB, b, B Procedural:
val a by literalToken("a")
val b by regexToken("[bB]")
override val root by parser {
    val aMatch = poll(a)
    val bMatch = b()
    aMatch?.text to bMatch.text
}
Combinator:
val a by literalToken("a")
val b by regexToken("[bB]")
override val root by maybe(a) and b map
    { (aM, bM) -> aM?.text to bM.text }
Parsing a token and ignoring its value Parses: ab, aB Procedural:
val a by literalToken("a")
val b by regexToken("[bB]")
override val root by parser {
    skip(a) // or just a() without using the value
    val bMatch = b()
    bMatch.text
}
Combinator:
val a by literalToken("a")
val b by regexToken("[bB]")
override val root by -a * b map { it.text }

Introduction

The goal of a grammar is to define rules by which to turn an input string of characters into a structured value. This value is usually an abstract syntax tree. But it could also be an evaluated result, if we have specified evaluation rules directly in the grammar.

In order to define a grammar we only need two things: list of tokens and a root parser. Here is how one of the simplest grammars looks with Parsus:

val g1 = object : Grammar<String>() {
    val tokenA by literalToken("a")
    override val root by parser { tokenA().text }
}

println(g1.parseOrThrow("a")) // prints "a"

It is just a few lines of declarative code, but there a lot going on under the hood. So, let us break it down.

Grammars

First, there is the Grammar class that needs to be extended in order to define you custom grammar. In the example above an anonymous class is declared, but it could just as well be a normal class.

class MyClass : Grammar<MyResult>() {
    // tokens and parsers go here

    override val root: Parser<MyResult> = TODO()
}

There are two important things to note. The Grammar is a generic class, and has a type parameter that defines the result type of the root parser. Because Kotlin requires us to specify type parameters of the class, often the explicit type of the root parser can be omitted. The root parser will be used to produce the parsed result when calling a method such as parseToEnd on a grammar. However, before we can discuss how to define the root and other parsers, we need to understand the basic building block of any parser - a token.

Tokens

Each token we declare within a grammar describes a pattern of how this token can be recognized in the input string. Whenever a parser requires the next token to proceed, the parser asks the grammar to find a token match for the current position in the input. When a match is found it is described by the token, an offset in the input string where the match starts, and the length of the match.

The simplest type of token is a literal token. It matches only strings that are exactly like the given literal. Therefore, the token tokenA from the example will only match if the character in the current position is "a".

    val tokenA by literalToken("a")

Another thing to note is that the member tokenA is declared via the by keyword, meaning that it uses Kotlin's property-delegation mechanism. When declaring tokens this way, they are automatically registred within a grammar, so they can participate in the matching process when parsing.

Alternatively, the token could be registered anonymously. This could be useful, when we do not need to reference the token anywhere else when writing parsers. Most often, the tokens that need to be ignored are defined this way.

val g2 = object : Grammar<String>() {
    init {
        regexToken("\\s+", ignored = true)
    }

    val tokenA by literalToken("a")
    override val root by parser { tokenA().text }
}

println(g2.parseOrThrow(" a\t")) // prints "a"

In this example, we create a token by calling regexToken. This token will use the regular expression to match any whitespace in the input string. Since we want to simply ignore the whitespace, we will not reference this token in any of the parsers. Therefore, we register the token in the init-block of the class without assigning it to a member.

Now, that we know how to declare and register different kinds of tokens, let us explore how to use those tokens to write parsers.

Parsers

Parser definition achieves two goals. Firstly, it defines the sequence of tokens that is expected to appear in the input. Secondly, it transforms the matched tokens into a value.

One of the simplest parsers that we can construct expects only one token and returns the text of the token match as a value. And that is exactly what we saw previously.

val g1 = object : Grammar<String>() {
    val tokenA by literalToken("a")
    override val root by parser { tokenA().text }
}

In order to understand how to use parsers, we need to take a look at the core abstractions.

The central piece of the puzzle is the Parser interface itself.

interface Parser<out T> {
    suspend fun ParsingScope.parse(): T
}

Essentially, a parser is a function that can be called within a parsing scope and would return a parsed value. When something is a function, it can almost certainly be represented as a lambda. This is exactly how we have seen the parsers to be defined using the parser { ... } function that takes lambda and returns a parser.

The parsing result is an explicit representation of either a successfully parsed value, or an error that the parser encountered while trying to process the input.

sealed class ParseResult<out T>
data class ParsedValue<T>(val value: T) : ParseResult<T>()
abstract class ParseError : ParseResult<Nothing>()
data class MismatchedToken(val expected: Token, val found: TokenMatch) : ParseError()
// more parser errors

The most powerful thing about parsers is that they can be composed. A parsing scope is what gives parsers this power. The parser scope interface provides an extension function to execute any parser and extract its result.

interface ParsingScope {
    suspend operator fun <R> Parser<R>.invoke(): R
    // ... more ...
}

We have already seen an example with a call to this function: tokens are parsers too. The Token class implements Parser<TokenMatch>, and when invoked within a parsing scope it would return an actual TokenMatch. From this match we can take the text fragment of the input string to which this match corresponds. The text fragment can then be converted into a number or stored as a name of an identifier, etc.

Here is grammar that parses an integer:

val g3 = object : Grammar<Int>() {
    val tokenNum by regexToken("[0-9]+")
    override val root by parser { tokenNum().text.toInt() }
}

println(g3.parseOrThrow("123")) // prints 123

Parser Combinators

In order to combine parsers, we need to define more than one. The intermediate parsers can be declared as members of the same grammar class to make them easier to be reused.

As we have learned previously, tokens are parsers. So we can define a couple of them to play with.

val g4 = object : Grammar<String>() {
    val tokenNum by regexToken("[0-9]+")
    val tokenId by regexToken("[a-z]+")
    val tokenPlus by literalToken("+")
    override val root by parser {
        val id = tokenId().text
        tokenPlus()
        val num = tokenNum().text
        "($id) + ($num)"
    }
}

println(g4.parseOrThrow("abc+123")) // prints "(abc) + (123)"

This example shows the main way in which parsers are combined - sequentially. The root parser expects first an id to appear, then a plus-sign, then a number. If at any point there is an unexpected token, then the whole parser fails with the mismatched-token error.

Notice also, that we use another useful property of the sequential execution. With the tokenPlus() statement we execute the parser, but we ignore the result. This is most often used with token-parsers when we only need to make sure that a certain piece of syntax is in the expected place in the input.

Another important way of combining parsers is to say that we expect one of several parsers to succeed at a certain point. Even in the case when the first parser fails, the parent parser does not produce an error immediately. Instead, the parent parser tries out the remaining alternatives. If there is one alternative that succeeds, the parent parser takes its result and proceeds without any errors.

We can use the choose function from the ParsingScope to achieve this behaviour:

val g5 = object : Grammar<String>() {
    val tokenNum by regexToken("[0-9]+")
    val tokenId by regexToken("[a-z]+")
    val tokenPlus by literalToken("+")
    override val root by parser {
        val idOrNum1 = choose(tokenNum, tokenId).text
        tokenPlus()
        val idOrNum2 = choose(tokenNum, tokenId).text
        "($idOrNum1) + ($idOrNum2)"
    }
}

println(g5.parseOrThrow("abc+123")) // prints "(abc) + (123)"
println(g5.parseOrThrow("909+wow")) // prints "(909) + (wow)"

Now we have a repeating piece of code inside our parser implementation. So we ought to refactor it by introducing another intermediate parser term to do the job. Since term is a parser, it can be invoked within the parsing scope.

val g6 = object : Grammar<String>() {
    val tokenNum by regexToken("[0-9]+")
    val tokenId by regexToken("[a-z]+")
    val tokenPlus by literalToken("+")
    val term by parser { choose(tokenNum, tokenId).text }
    override val root by parser {
        val idOrNum1 = term()
        tokenPlus()
        val idOrNum2 = term()
        "($idOrNum1) + ($idOrNum2)"
    }
}

Armed with this knowledge of the basics, you can now explore more sophisticated parser implementations that use various extension functions to make parser definitions look declarative. Also, feel free to get familiar the with core interfaces and their extension functions to learn how more elaborate parser combinators can be created from the provided primitives.

Examples

Here are some examples of grammars written with Parsus:

Coroutines

Most often, coroutines in Kotlin are explored and used in the context of concurrency. This is not surprising, because they allow turning callback-ridden asynchronous code into sequential implementations that are less error-prone and easier to read.

In Kotlin, structured concurrency and other machinery related to multi-threaded environments are provided by kotlinx.coroutines library. Note the x after kotlin. This library, like any other, makes use of lower-level capabilities of the language itself. More specifically, the main and only mechanism of Kotlin enabling coroutines is suspension.

Kotlin's suspend keyword allows declaring so called suspending functions. Most of the time adding this additional keyword will be seen as a necessary down payment prior to entering the world of structured concurrency. Not all the time, though. Even in the Kotlin standard library there is at least one example of using suspending functions without any multi-threaded context. Namely, sequence builders.

You can build an infinite sequence of Fizz-Buzz numbers like this:

fun main() {
    val fb = sequence {
        var i = 1
        while (true) {
            if (i % 3 == 0 || i % 5 == 0) yield(i)
            i++
        }
    }

    for (x in fb.take(10)) {
        println(x)
    }
}

As you may have guessed, the lambda we pass to the sequence builder is a suspending function. From inside this lambda we can use yield function, which is also suspending.

After a careful inspection, we can conclude that suspending functions related to sequence builders have nothing to do with dispatchers, flows and channels from kotlinx.coroutines. Both of these cases simply highlight Kotlin's more powerful built-in capabilities. Even more applications of "bare" coroutines can be found elsewhere. E.g. coroutines can aid in rather idiomatic implementation of monads directly in Kotlin.

Finally, this project itself takes on a mission of leveraging coroutines to construct and execute parsers. Continuations, as first-class citizens, can be stored in memory, entirely avoiding unexpected stack-overflows for heavily nested parsing rules and deeply-structured input. Suspending functions make sequential composition of parsers trivial. Error-handling mechanisms that come with coroutines allow for declarative definition of branching in parsers. Everything else is a fully extensible and debuggable collection of combinators on top of just a couple core primitives.

Acknowledgements

The structure of the project as well as the form of the grammar DSL is heavily inspired by the better-parse library.

License

Distributed under the MIT License. See LICENSE for more information.

parsus's People

Contributors

alllex avatar asemy avatar sebastianaigner avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

parsus's Issues

Request: convert to Kotlin Multiplatform

Hi,

I'd like to use Parsus in a Kotlin/Native application (Windows/Linux/macOS). Since Parsus has 0 dependencies, I think this could be achieved with a minor code refactor, and updating the Gradle build config.

I've had a look at the codebase and from what I can see the only JVM specific dependencies are:

I'm willing to make a PR, so if you're interested I can submit one and see what it looks like in practice.

MismatchedToken error: `Token(EOF)` is expected, despite requesting the found token of `LiteralToken(' ')`

I would like to parse the following (which is a simplified example of the full text I would like to parse).

Version: 1.2.3
  Features:
  Fixes:

Following Version is a list of category strings. Each category must prefixed with two spaces and suffixed with a colon :.

I'd like to parse it into this class:

data class Demo(
  val version: String,
  val categories: List<String>
)

I have written a parser (see full code below) that takes the leading whitespace into account.

  /** leading category-name whitespace, to be ignored */
  private val categoryNameIndent by literalToken("  ")
  private val categoryNameSuffix by literalToken(":")
  private val categoryName by -categoryNameIndent * text * -categoryNameSuffix

However, I get an error

MismatchedToken(expected=Token(EOF), found=TokenMatch(token=LiteralToken('  '), offset=15, length=2))

This error is very confusing because it seems to have swapped around the expected/found. I didn't expect EOF, while I did expect LiteralToken(' '). And even then, why did the parser not find the literal token? It's hard to figure out, even when debugging, so help would be appreciated.

fun main() {
  val demo = DemoGrammar.parseEntire(
    /* language=text */ """
Version: 1.2.3
  Features:
  Fixes:
""".trimIndent()
  )

  println("parsed demo: $demo")
}


object DemoGrammar : Grammar<Demo>(debugMode = true) {
  private val newline by regexToken("""\n|\r\n|\r""")
  private val text by regexToken(""".+""")

  private val versionTag by literalToken("Version: ")
  private val version by -versionTag * text

  private val categoryNameIndent by literalToken("  ")
  private val categoryNameSuffix by literalToken(":")
  private val categoryName by -categoryNameIndent * text * -categoryNameSuffix

  private val categorySection: Parser<String> by parser {
    println("parsing CategorySection")
    val name = categoryName().text
    println("  name: $name")
    println(newline())
    name
  }

  override val root: Parser<Demo> by parser {
    val version = version().text
    println("  version:$version")
    val categories = repeatZeroOrMore(categorySection)
    repeatZeroOrMore(newline)
    Demo(
      version = version,
      categories = categories,
    )
  }
}

Output:

  version:1.2.3
parsing CategorySection
parsed demo: MismatchedToken(expected=Token(EOF), found=TokenMatch(token=LiteralToken('  '), offset=15, length=2))

Suggestion: Sort literal tokens by literal length

With the current behavior, the following code throws an unexpected eof error:

val testGrammar = object : Grammar<TokenMatch> {
    val single by literalToken("<")
    val double by literalToken("<<")
    
    override val root = single or double
}

testGrammar.parse("<<")

In this example, the second token is always unreachable. This can easily be solved by the user by swapping the definitions of single and double like so, since the lexer basically goes through the literal tokens in the order they're defined in at the moment:

val testGrammar = object : Grammar<TokenMatch> {
    val double by literalToken("<<")
    val single by literalToken("<")
    
    override val root = single or double
}

testGrammar.parse("<<")

My proposal would be to sort the literal tokens by their length before attempting to parse them. When the lexer attempts to parse longer tokens first, this will always lead to the correct result and the user can no longer define literal tokens that are unreachable in lexing.

I'm not sure you intended for users to be able to control the parse order by the order of the definitions. One negative aspect of this suggestion I thought about is that it could lead to performance overhead, since users can't put more frequently occurring tokens at the top of their grammar to speed up the matching process for those tokens. A compromise here could be a more elaborate algorithm that moves unreachable tokens in front of their reachable counterparts when the lexer is initialized. This still lets people maintain control over most of the parse order and just eliminates the unreachable tokens.

The ultimate decision is not straight forward, but I personally think that the benefits for usability for people new to the library outweigh the drawbacks for experienced users.

Wildcard regex token after literal token does not work

I read about parsus a while ago and wanted to incorporate it into a multi-platform side-project of mine.
Sadly, I encountered the following behavior. I'm unsure whether I'm using the library wrong or I've encountered a bug, so any help is appreciated.

Basically I want to parse a string consisting of a limited character set, followed by a literal : and then ending in arbitrary text. My actual use case is a little more advanced, but this is the minimal subset which reproduces the problem.

object ProceduralExampleGrammar : Grammar<Pair<String, String>>() {
    private val firstRegex by regexToken("""[A-Za-z0-9-]+""")
    private val literal by literalToken(":")
    private val secondRegex by regexToken(""".+""")

    override val root by parser {
        val first = firstRegex()
        literal()
        val second = secondRegex()
        Pair(first.text, second.text)
    }
}

fun main() {
    val parseResult = ProceduralExampleGrammar.parse("FOO:BAR")
    val pair = parseResult.getOrThrow()
}

The same behavior can be observed, when using the combinator syntax.

object CombinatorExampleGrammar : Grammar<Pair<String, String>>() {
    private val firstRegex by regexToken("""[A-Za-z0-9-]+""")
    private val literal by literalToken(":")
    private val secondRegex by regexToken(""".+""")

    override val root by firstRegex * -literal * secondRegex map { (first, second) ->
        Pair(first.text, second.text)
    }
}

fun main() {
    val parseResult = CombinatorExampleGrammar.parse("FOO:BAR")
    val pair = parseResult.getOrThrow()
}

Am I using parsus wrong? Or have I stumbled upon a bug?

Case insensitive matching

I'm trying to write a parser for the Factorio changelog format, which is very particular and easy to get wrong. For example, the date field is optional, and must be captialized as Date. However, it's easy to make a mistake and use the wrong capitalization, e.g. date or DATE.

I'd like to be able to use Parsus to capture the values leniently, so that I can later do validation.

Literal tokens

Adding an ignoreCase parameter to LiteralToken would be helpful, and easy to implement.

class LiteralToken(
  private val string: String,
  private val ignoreCase: Boolean = false,
  name: String? = null,
  ignored: Boolean = false
) : Token(name, ignored) {

  override fun match(input: CharSequence, fromIndex: Int): Int =
    when {
      input.startsWith(string, fromIndex, ignoreCase = ignoreCase) -> string.length
      else                                                         -> 0
    }
}

Regex token

Kotlin's Regex has an option to ignore the case
https://kotlinlang.org/api/latest/jvm/stdlib/kotlin.text/-regex-option/-i-g-n-o-r-e_-c-a-s-e.html

What do you think about adding a parameter to set the regex options in the regexToken() function? For example:

private val datePrefix by regexToken("""date:\s*""", options = setOf(RegexOption.IGNORE_CASE))

Although perhaps implementing a custom builder might be more succinct:

private val datePrefix by regexToken("""date:\s*""") {
 options += RegexOption.IGNORE_CASE
}

Or maybe just a boolean flag, for simplicity

private val datePrefix by regexToken("""date:\s*""", ignoreCase = true)

Feature Request: Partial Parsing

Context: I'm porting an existing handwritten parser to use parsus.

Right now, parsus only supports fully parsing a string. If the string doesn't completely parse, it returns an error noting an index noting how far it was able to parse.

It would be very helpful if it also returned what it had succesfully parsed up until that point.

Because it doesn't have that feature, I've resorted to this hack:

class Grammar<Expression>() {
    // ... the grammar I've ported over

    abstract val bind: Parser<T>
    private val any by token { input, fromIndex -> input.length - fromIndex }
    val remain by parser {
        val v = bind() to currentOffset
        while (true) {
            when (val t = currentToken?.token) {
                EofToken, null -> break
                else -> skip(t)
            }
        }
        v
    }
}

Feature Request: Composing Grammars

Context: I'm porting an existing handwritten parser to use parsus.

Right now, parsus is built on the presumption that a Grammar is complete. It would be helpful if Grammars could be composed.

Because it doesn't support this natively, I've resorted to this hack:

internal abstract class C<T> : Grammar<T>() {
   // ... grammar
}

val extendedGrammar = object : C<Expression>() {
    // Additional grammar
    override val root by parser_in_C_class
}

Feature: Use Collections instead of Tuples at the extremes

Context: I'm porting an existing handwritten parser to use parsus.

Right now, parsus implements its own tuple types. Presumably out of a sheer appeal to sanity, their composition is capped at seven items. This restriction wasn't obvious and was frustrating when composing parsers.

It would be nice if parsus, instead, overflowed into using an Array or List type. Kotlin supports List destructuring, so (I lightly suspect) it wouldn't impact the ease of Grammar design.

Publish the library to maven central

The library looks pretty interesting. Would you please publish it on maven central as bintray is mostly blocked in many corporate environments?

ParseException: Unmatched token at offset=20, when expected: Token(EOF)

Given the following grammar:

    data class MappingEntry(val destinationStart: Long, val sourceStart: Long, val length: Long) {
        private val sourceRange: LongRange = sourceStart..(sourceStart + length)
        private val destinationRange: LongRange = destinationStart..(destinationStart + length)

        fun lookup(source: Long): Long? =
                if (source in sourceRange) destinationStart + (source - sourceStart) else null

        fun reverseLookup(destination: Long): Long? =
                if (destination in destinationRange) sourceStart + (destination - destinationStart) else null

    }

    data class Mapping(val entries: List<MappingEntry>) {
        fun lookup(source: Long): Long = entries.firstNotNullOfOrNull { it.lookup(source) } ?: source

        fun reverseLookup(destination: Long): Long =
                entries.firstNotNullOfOrNull { it.reverseLookup(destination) } ?: destination

    }

    val parser = object : Grammar<Pair<List<Long>, List<Mapping>>>(debugMode = true) {
        val num by regexToken("\\d+")
        val nl by literalToken("\n", "newline")
        val sp by regexToken(" +", "space")
        val colon by literalToken(":")
        val seedLit by literalToken("seeds")
        val toLit by literalToken("-to-")
        val mapLit by literalToken("map")
        val word by regexToken("[a-zA-Z]+")
        val range by num * -sp * num * -sp * num map {
            MappingEntry(it.t1.text.toLong(), it.t2.text.toLong(), it.t3.text.toLong())
        }
        val ranges by separated(range, nl)
        val mappingName by word * -toLit * word * -sp * -mapLit * -colon * -nl
        val mapping by -mappingName * ranges * -nl map { Mapping(it) }
        val mappings by separated(mapping, nl)
        val seeds by -seedLit * -colon * -sp * separated(num, sp) * -nl * -nl map { it.map { it.text.toLong() } }

        override val root by seeds * mappings map { it.t1 to it.t2 }
    }

and the following input:

seeds: 79 14 55 13

seed-to-soil map:
50 98 2
52 50 48

soil-to-fertilizer map:
0 15 37
37 52 2
39 0 15

fertilizer-to-water map:
49 53 8
0 11 42
42 0 7
57 7 4

water-to-light map:
88 18 7
18 25 70

light-to-temperature map:
45 77 23
81 45 19
68 64 13

temperature-to-humidity map:
0 69 1
1 0 69

humidity-to-location map:
60 56 37
56 93 4

I get the following exception:

Exception in thread "main" ParseException: Unmatched token at offset=20, when expected: Token(EOF)
	at me.alllex.parsus.parser.ParseResultKt.getOrThrow(ParseResult.kt:137)
	at me.alllex.parsus.parser.Grammar.parseOrThrow(Grammar.kt:73)
	at Day05Kt.main(Day05.kt:76)
	at Day05Kt.main(Day05.kt)

But the code and input seem legit. I would expect to see at least an expected token and the exact part of text with error

Optional token?

I'm trying to write a parser for a SemVer string that might or might not end in -SNAPSHOT.

How can I tell Parsus that the token is optional?

Currently I get an exception, because the SNAPSHOT token parser is mandatory.

import me.alllex.parsus.parser.*
import me.alllex.parsus.token.literalToken
import me.alllex.parsus.token.regexToken

fun main() {
  println(SemVerParser.parseEntireOrThrow("1.2.3-SNAPSHOT")) // Success ✅
  // prints: 1.2.3-SNAPSHOT
  println(SemVerParser.parseEntireOrThrow("1.2.3")) // ERROR ❌
  // Exception in thread "main" ParseException(MismatchedToken(expected=LiteralToken('-'), found=TokenMatch(token=Token(EOF), offset=5, length=1)))

}

private object SemVerParser : Grammar<SemVer>() {
  private val dotSeparator by literalToken(".")
  private val dashSeparator by literalToken("-")

  /** Non-negative number that is either 0, or does not start with 0 */
  private val number: Parser<Int> by regexToken("""0|[1-9]\d*""").map { it.text.toInt() }

  private val snapshot: Parser<Boolean> by -dashSeparator * literalToken("""SNAPSHOT""")
    .map { it.text == "SNAPSHOT" }

  override val root: Parser<SemVer> by parser {
    val major = number()
    dotSeparator()
    val minor = number()
    dotSeparator()
    val patch = number()
    val snapshot = snapshot() // how can I make this optional?
    SemVer(
      major = major,
      minor = minor,
      patch = patch,
      snapshot = snapshot,
    )
  }
}

private data class SemVer(
  val major: Int,
  val minor: Int,
  val patch: Int,
  val snapshot: Boolean,
) {
  override fun toString(): String =
    "$major.$minor.$patch" + if (snapshot) "-SNAPSHOT" else ""
}

Add `parseEntireOrNull()` function

I'd like to be able to parse and receive null if the parsing fails.

  • This would be useful because I could parse a statement and, if it fails, then throw a more specific exception. Example:

    val statement = "a & (b1 -> c1) | a1 & !b | !(a1 -> a2) -> a"
    val ast = booleanGrammar.parseEntireOrNull(statement)
      ?: error("failed to parse boolean statement '$statement'")
  • Add an *OrNull() variant matches the Kotlin standard library functions patterns, which use *OrNull() for parsing numbers, or getting elements from lists, and more.

Remove parseEntireOrThrow()?

I'd even go further and completely remove the parseEntireOrThrow() function.

  • I think it would be more idiomatic to give the option to users to throw using an elvis operator, because it would encourage more detailed error messages.

  • Introducing try/catches into code that relies on coroutines can be risky, because there's a chance users will catch a general Exception, which can disrupt Coroutines cancellation if implemented improperly. (Because coroutines are cancelled via a CancellationException, if an Exception is swallowed then a cancellation won't propagate).

    Returning null would also allow for handling error scenarios without needing to worry about try/catches:

    val statement = "help"
    val ast = booleanGrammar.parseEntireOrNull(statement)
    if (ast == null) {
      val cmd = commandGrammar.parseEntireOrNull(statement)
      handleUnexpectedCommand(cmd)
      return
    }

Document parseEntireOrThrow() with @Throws

Because I'm proposing removing parseEntireOrThrow() I thought I'd put this issue here rather than make a new issue.

I'm using Parsus in a .kts script, and unfortunately IntelliJ doesn't handle dependencies very well, so I can't see the source code documentation that says what exception parseEntireOrThrow() could throw.

image image

Adding @Throws would help with JVM consumers.

Adding Parsus as a dependency sometimes causes error: "Cannot access 'Language': it is internal in 'org.intellij.lang.annotations'"

Unfortunately the @Language injection workaround doesn't play well in regular projects. The org.intellij.lang.annotations.Language defined by Parsus 'overrides' the official one provided by java-annotations, and because Parsus' is marked as internal, it causes compilation errors in the project whenever Parsus is present.

The quickest way to reproduce this is in a Gradle build script, although I also encounter the same problem in src/main/kotlin in regular projects:

// build.gradle.kts
import org.intellij.lang.annotations.Language

buildscript {
  repositories {
    mavenCentral()
    gradlePluginPortal()
  }
  dependencies {
    classpath("me.alllex.parsus:parsus:0.6.1")
  }
}

@Language("text")
val foo = "..."

The result is that whenever I have Parsus as a dependency, my project cannot use @Language at all. And I can't think of a workaround.

Suggestions

I have two suggestions:

  1. Change the visibility of Parsus' @Language from internal to public.
  2. java-annotations has been converted to KMP in v25.0.0, although it is yet to be released. This will provide a valid multiplatform @Language. So, for now, remove Parsus' @Language, and then when java-annotations v25.0.0 is released, switch to use that.

Related

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.