hylo-lang / lotsawa Goto Github PK

View Code? Open in Web Editor NEW

18.0 5.0 2.0 1.95 MB

A Swift implementation of the MARPA algorithms

License: Apache License 2.0

Swift 99.97% Dockerfile 0.03%

lotsawa's Introduction

Lotsawa

A parsing library and tool with the essential features of MARPA, encoded in pure Swift.

In particular, like MARPA, Lotsawa:

Parses any LR-regular grammar in linear time.
Parses some non-LR-regular grammars in linear time.
Produces the parse forests for any CFG.

Lotsawa owes almost everything of value to MARPA and its author, Jeffrey Kegler, for uncovering the thread of progress in parsing technology, gathering it together into one group of algorithms, proving important properties about them, and contributing some key innovations. This project exists primarily because MARPA is missing functionality needed by the Hylo language implementation.

Secondary reasons Lotsawa might be useful:

Lotsawa supports high-level usage from a safe, statically-typed language that compiles to efficient native code.
Lotsawa has a simple build/installation process and no dependencies other than Swift.
Lotsawa encodes the grammar analysis and recognition algorithms with a relatively small amount of high-level code; it may serve as a better reference for understanding the technology than either the highly theoretical Marpa paper or from libmarpa's C implementation, which must be extracted from a CWeb document.
Lotsawa can be used to precompile a grammar into static tables, eliminating some initial startup cost.
Lotsawa uses pure Mutable Value Semantics, thus eliminating many possible sources of bugs, including data races.

lotsawa's People

Contributors

Stargazers

${ @MainActor _ :  in \M3.Ultra }(Swift) avatar$

Watchers

Forkers

dan-zheng mrcodechef

lotsawa's Issues

Use "predecessor" consistently

We're using “prefix” to mean the predecessor of an Earley item in several places. Let's clean that up?

Document and test representational limits of `Chart.Item`

Since we're doing bit-packing, let's make sure we have well-documented preconditions and that we test the edge cases.

Consider whether isLeoUnique (and isPenultUnique) is needed

If uniqueness wasn't guaranteed by other features of the algorithm, it's hard to imagine that the Leo optimization would actually work for all grammars and inputs. Something seems a little bit fishy about this test, but I need to better understand the optimization before I can say.

Vocabulary in recognizer code

I've given the recognizer code a couple of readings, and two comments:

I take it the PartialParses object is more or less what's called an "Earley table" in the literature. Analogous with what I suggested for PartialParse, I'd suggest the name be changed to make it easier for a reader to see the connection to the Earley literature.

"Earleme" seems to be used as a synonym for "Earley set". In my usage it refers to locations, and differs from "Earley set ID" or "Earley set ordinal" only when variable length tokens are in use, and there are locations without Earley sets. If you are not currently implementing that feature, I would not use the term "earleme" at all, to avoid confusion with the concept in my writings.

I can see how the confusion might arise, because you'll see me use phrases like "look for a prediction at an earleme" all the time in the Marpa literature, where the meaning is very close to "look for a prediction in an Earley set". And perhaps I used the term loosely at points. But, analogous to "Earley item" and "Earley table", best to track the terms in the literature, when all else is equal.

Take a look at JS implementation

I just found a JavaScript implementation, https://github.com/aredridel/lotsawa, which may be of some interest.

/cc @jeffreykegler

I'm

Consider more efficient storage for Leo items

We could store them along with Earley items (PartialParses) in a single array in various ways:

Add storage for a symbol (perhaps the postdot symbol) to Earley items to make them the same size as Leo items. Burn a bit to indicate Leo or Earley.
Store the transition symbol of a Leo item separately from its PartialParse

Add a function to test a claim of LR-Regularity

It would be useful to be able to validate that a grammar is LR-regular and thus can be parsed in linear time. It would be useful to diagnose the causes of non-LR-regularity.

First-class representation of Item/ItemID

First of all Chart.Item should be renamed Chart.ItemID, since multiple entries with the same ID
actually constitute an item.

Then, perhaps Entry should be renamed Cause or Causation or Derivation.

Then Item should be a first-class type wrapping a Chart.Entries.SubSequence, and having a sequence of Derivations or at least MainstemIndices.

Then there should be a first-class representation for the sequence of Items in an Earleme having a given
transition symbol.

All of this is currently ad-hoc in the type system.

Move TestRecognizer chart printing out of TestGrammar

It really wants first-class access to chart items.

Consider eliminating nulling symbols from the transformed grammar

It's an optimization, but MARPA's implementation does this, and simply reconstructs them during tree traversal.

License Question

Hello,

I am interested in using Lotsawa for parsing a custom metalanguage.

May I ask what license this project has? I know the val-compiler is Apache-2.0, but I didn't know if a similar license applies for Lotsawa. I am following the advice in this post.

Consider more efficient lookup of partial parses to advance

Auxilliary hashtables probably only become efficient when there are many partial parses active in each Earleme, but may be necessary to guarantee a strict O(N) upper bound on parsing.

Also possible: sort the elements of an Earleme so we can use binary search. Likely also only a win when there are many partial parses in each Earleme. Slightly complicated by the fact that the current earleme can't be sorted (though #6 would make that a non-issue).

Small linear searches are extremely hard to beat—this suggests that the break-even point for int-sized data is around 60 elements, and this suggests 22 elements for int triples.

Probably Aycock&Horspool optimization into a pushdown automaton would reduce the number of partial parses in each Earleme, so in the end this may never become a real win for any practical grammar.

Need to measure, obviously