Giter VIP home page Giter VIP logo

lotsawa's Introduction

Lotsawa

A parsing library and tool with the essential features of MARPA, encoded in pure Swift.

In particular, like MARPA, Lotsawa:

  • Parses any LR-regular grammar in linear time.
  • Parses some non-LR-regular grammars in linear time.
  • Produces the parse forests for any CFG.

Lotsawa owes almost everything of value to MARPA and its author, Jeffrey Kegler, for uncovering the thread of progress in parsing technology, gathering it together into one group of algorithms, proving important properties about them, and contributing some key innovations. This project exists primarily because MARPA is missing functionality needed by the Hylo language implementation.

Secondary reasons Lotsawa might be useful:

  • Lotsawa supports high-level usage from a safe, statically-typed language that compiles to efficient native code.
  • Lotsawa has a simple build/installation process and no dependencies other than Swift.
  • Lotsawa encodes the grammar analysis and recognition algorithms with a relatively small amount of high-level code; it may serve as a better reference for understanding the technology than either the highly theoretical Marpa paper or from libmarpa's C implementation, which must be extracted from a CWeb document.
  • Lotsawa can be used to precompile a grammar into static tables, eliminating some initial startup cost.
  • Lotsawa uses pure Mutable Value Semantics, thus eliminating many possible sources of bugs, including data races.

lotsawa's People

Contributors

dabrahams avatar nickpdemarco avatar dependabot[bot] avatar dan-zheng avatar

Stargazers

Garrett van Wageningen avatar Ali A. Hilal avatar MetaSky avatar Jonathan Cooper avatar Parham Mohammadi avatar Chris avatar Gregory Petrosyan avatar Deyan Ginev avatar Jeffrey Kegler avatar adamz avatar Antonio De Lucreziis avatar { @MainActor _ :  in \M3.Ultra }(Swift) avatar 文宇祥 avatar Nikita avatar joao guilherme avatar  avatar Fan Jiang avatar Tim Kersey avatar

Watchers

 avatar Jeffrey Kegler avatar Dimi Racordon avatar  avatar Oscar Bender-Stone avatar

lotsawa's Issues

Consider whether isLeoUnique (and isPenultUnique) is needed

If uniqueness wasn't guaranteed by other features of the algorithm, it's hard to imagine that the Leo optimization would actually work for all grammars and inputs. Something seems a little bit fishy about this test, but I need to better understand the optimization before I can say.

Vocabulary in recognizer code

I've given the recognizer code a couple of readings, and two comments:

I take it the PartialParses object is more or less what's called an "Earley table" in the literature. Analogous with what I suggested for PartialParse, I'd suggest the name be changed to make it easier for a reader to see the connection to the Earley literature.

"Earleme" seems to be used as a synonym for "Earley set". In my usage it refers to locations, and differs from "Earley set ID" or "Earley set ordinal" only when variable length tokens are in use, and there are locations without Earley sets. If you are not currently implementing that feature, I would not use the term "earleme" at all, to avoid confusion with the concept in my writings.

I can see how the confusion might arise, because you'll see me use phrases like "look for a prediction at an earleme" all the time in the Marpa literature, where the meaning is very close to "look for a prediction in an Earley set". And perhaps I used the term loosely at points. But, analogous to "Earley item" and "Earley table", best to track the terms in the literature, when all else is equal.

Consider more efficient storage for Leo items

We could store them along with Earley items (PartialParses) in a single array in various ways:

  1. Add storage for a symbol (perhaps the postdot symbol) to Earley items to make them the same size as Leo items. Burn a bit to indicate Leo or Earley.
  2. Store the transition symbol of a Leo item separately from its PartialParse

First-class representation of Item/ItemID

First of all Chart.Item should be renamed Chart.ItemID, since multiple entries with the same ID
actually constitute an item.

Then, perhaps Entry should be renamed Cause or Causation or Derivation.

Then Item should be a first-class type wrapping a Chart.Entries.SubSequence, and having a sequence of Derivations or at least MainstemIndices.

Then there should be a first-class representation for the sequence of Items in an Earleme having a given
transition symbol.

All of this is currently ad-hoc in the type system.

License Question

Hello,

I am interested in using Lotsawa for parsing a custom metalanguage.

May I ask what license this project has? I know the val-compiler is Apache-2.0, but I didn't know if a similar license applies for Lotsawa. I am following the advice in this post.

Consider more efficient lookup of partial parses to advance

Auxilliary hashtables probably only become efficient when there are many partial parses active in each Earleme, but may be necessary to guarantee a strict O(N) upper bound on parsing.

Also possible: sort the elements of an Earleme so we can use binary search. Likely also only a win when there are many partial parses in each Earleme. Slightly complicated by the fact that the current earleme can't be sorted (though #6 would make that a non-issue).

Small linear searches are extremely hard to beat—this suggests that the break-even point for int-sized data is around 60 elements, and this suggests 22 elements for int triples.

Probably Aycock&Horspool optimization into a pushdown automaton would reduce the number of partial parses in each Earleme, so in the end this may never become a real win for any practical grammar.

Need to measure, obviously

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.