carocad / parcera Goto Github PK

View Code? Open in Web Editor NEW

106.0 5.0 6.0 378 KB

Grammar-based Clojure(script) parser

License: GNU Lesser General Public License v3.0

Clojure 81.36% ANTLR 15.97% JavaScript 2.68%

clojure parser grammar ast antlr4 reader

parcera's People

Stargazers

Watchers

Forkers

pez mauricioszabo linkning severeoverfl0w ahmedmrefaat noahtheduke

parcera's Issues

Fails on chained tagged literals

This is perfectly valid code, but parcera cannot handle it:

(parcera/ast "#a #b 1")
;; =>
(:code                                                                               
 (:tag (:symbol "a") (:whitespace " ") (:parcera.core/failure "#") (:symbol "b"))    
 (:whitespace " ")                                                                   
 (:number "1"))

Octal characters

AFAICT, octal characters (not numbers -- e.g. \o013) do not appear to be used very widely (at least not based on my sampling of code on clojars).

Here is a real-world example:

https://github.com/clj-commons/camel-snake-kebab/blob/692c78fcba90c61f1c17d7f18e50d31bb0ada4d6/src/camel_snake_kebab/internals/string_separator.cljc#L37

IIUC, one can have one, two, or three octal "digits" after the \o, but when there are three digits, the left-most digit should be less than or equal to 3. So for example, these work:

\o1
\o77
\o377

But this doesn't:

\o400

If it seems worth adding support for this, for parsing of the three-digit case, perhaps it is not worth trying to make the left-most digit's value constrained.

FWIW, I think some relevant code in clojure's source is:

https://github.com/clojure/clojure/blob/30a36cbe0ef936e57ddba238b7fa6d58ee1cbdce/src/jvm/clojure/lang/LispReader.java#L1218-L1227

consider migrating to deps.edn and plain cljs.main instead of figwheel

Hopefully this would make the setup of the project a bit more simple (less tooling required)

https://oli.me.uk/clojure-and-clojurescript-tests-on-travis/

refactor test suite to avoid ambiguity and guarantee correctness

Currently we check that the parcera can do a roundtrip for any input string. The problem with this approach is that it doesnt take ambiguity into account.

For example:
~@hello could be parsed as:

symbol
unquote + @hello (symbol)
unquote-splicing + symbol

Right now parcera doesnt check that even thought it should.

The other problem is that parcera only checks the "accepted" parsed AST. It doesnt check that no other interpretations are available. So it cannot guarantee that it will always return the same value.

redundant whitespace nodes

It seems to me that the [:whitespace ""] nodes aren't necessary:

user=> (parcera/clojure "{:a 1}")
[:code [:whitespace ""] [:map [:map-content [:whitespace ""] [:simple-keyword "a"] [:whitespace " "] [:whitespace ""] [:number "1"] [:whitespace ""]]] [:whitespace ""]]

Why is the result wrapped in a [:code ...] vector? Can it be something else than code?

consider changing the antlr4 tree transformation logic

currently I use a nested lazy sequence transformation to transform the result from Antlr4 into a Clojure data structure.

This provides a good performance since the transformation is lazy and the user only pays for what they consume. Unfortunately this transformation incurs in some overhead (not sure the exact cause ... maybe just the price of immutability ? or the deep nested stack ?), last time that I checked antlr4 was able to parse clojure.core in around 30ms but the transformation adds another ~~200ms~~ 150ms to realise all sequences.

I see several options:

drop to Java collections and make the transformation eager. See (java.util.Collections/unmodifiableList (new java.util.ArrayList)) on a repl
avoid the deeply nested stack and see if that improves things

Things that can have metadata on them

The current list of things in the grammar that can have metadata on them is:

( symbol
| collection
| set
| namespaced_map
| tag
| fn
| unquote
| unquote_splicing

It appears that there are more possibilities. Additional examples for:

| conditional
| deref
| quote
| backtick
| var_quote

are respectively:

In this list, deref seems to be the most common, followed by quote, backtick, and conditional. I only encountered one example for var-quote.

The clojure.org metadata page currently says:

Note that metadata reader macros are applied at read-time, not at evaluation-time, and can only be used with values that support metadata, like symbols, vars, collections, sequences, namespaces, refs, atoms, agents, etc. Some important exceptions that don’t support metadata are strings, numbers, booleans, Java objects, keywords (these are cached and can be shared within the runtime), and deftypes (unless they explicitly implement clojure.lang.IMeta).

The "etc." part seems to be able to include a fair bit -- as long as it implements `clojure.lang.IMeta" I guess, and IIUC this is not something that one can necessarily tell in all cases without running the code.

I haven't gone through the grammar exhaustively to check for other possibilities so I suppose there could be more...

consider making validation customisable

I was assuming both clojure and clojurescript to follow the reader reference implementation: https://clojure.org/reference/reader

However it seems that there are differences (both on purpose or accidental) in the way that the reader behaves.

I think that in order to avoid blowing my mind (and compromising parcera's future development) I should just allow the user to be able to control how much validation they want to enforce. This could simply be a flag on parcera/ast function.

Handling the sequence: number, discard, something

In clj I get:

user=> [1#_ 2]
[1]

In parcera I get:

parcera.core=> (ast "[1#_ 2]")
(:code (:vector (:symbol "1#_") (:whitespace " ") (:number "2")))

Possibly a bit simpler were:

parcera.core=> (ast "1#_ 2")
(:code (:symbol "1#_") (:whitespace " ") (:number "2"))

and:

parcera.core=> (ast "1#_2")
(:code (:symbol "1#_2"))

This is with: 1c51356

Doesn't support conditional splicing with a space prefix

#?@() works, #?@ () fails.

#?@() => (:code (:conditional_splicing))
#?@ () => (:code (:tag (:symbol "?") (:parcera.core/failure "@") (:whitespace " ") (:list)))

consider using records inside instaparse for speed

Based on this presentation about Clojure performance: https://www.youtube.com/watch?v=3SSHjKT3ZmA&t=1345s

I think that replacing the Listener functions with records on the instaparse algorithm could yield a significant performance improvement.

References:

https://github.com/Engelberg/instaparse/blob/master/src/instaparse/gll.cljc#L554

consider using mutable data structures inside instaparse for speed

Based on this presentation about Clojure performance: https://www.youtube.com/watch?v=3SSHjKT3ZmA&t=1345s

I think that replacing the atoms in the Trampoline data structure with mutable data structures could yield a significant performance improvement.

References:

https://github.com/Engelberg/instaparse/blob/master/src/instaparse/gll.cljc#L240

consider making parcera available on Clojurescript

Currently parcera requires the StringBuilder class from Java in order to make the writing process fast ( Clojure's str is too slow on large input).

One option mentioned by @borkdude is to use StringBuffer from the google library to support the same on Clojurescript.

Depending on the performance and added complexity I should consider this as a long term goal.

Open question:

how can I test this ? ... my knowledge of cljs is quite limitted 🤔

Parsing keywords starting with number blows up

(parcera/ast ":1") => clojure.lang.PersistentList cannot be cast to java.lang.CharSequence

This problem seems to be twofold:

parcera.core/failure expects that :simple_keyword will always have a String for it's child, so blows up if the keyword works out invalid (as it isn't a string but a list with the failure info)
While it is ambiguous as to whether this is a valid keyword or not (after all, symbols must not start with a number, but keywords must start with a : after that I guess it's fair game with the rest of symbol?) the Clojure reader does accept it, so does the ClojureScript reader.

Fails on conditional splicing inside map

e.g. {#?@()} is valid, but parcera fails with: (:code (:parcera.core/failure (:map (:conditional_splicing)))), message: :message "((:conditional_splicing)) - failed: (even? (count %)) spec: :parcera.spec/map\n"

Comment and metadata issue

It looks like if there is a comment right after some metadata the parse result may be off:

(pc/ast "^{:a true} ;; hello\n [:a]")
#_ 
(:code
 (:metadata
  (:metadata_entry
   (:map
    (:keyword ":a") (:whitespace " ")
    (:symbol "true")))
  (:whitespace " "))
 (:comment ";; hello") 
 (:whitespace "\n ")
 (:vector
  (:keyword ":a")))

Note that the comment, last whitespace, and vector are siblings of metadata and not children.

This is in contrast to:

(pc/ast "^{:a true} [:a]")
#_
(:code
 (:metadata
  (:metadata_entry
   (:map
    (:keyword ":a") (:whitespace " ")
    (:symbol "true")))
  (:whitespace " ")
  (:vector (:keyword ":a"))))

where vector is a child of metadata.

I encountered this in the wild: https://github.com/lspector/Clojush/blob/master/src/clojush/instructions/code.clj#L166-L176

Consider supporting legacy NaN Inf, etc.

Before Clojure added an official ##NaN, there was literals provided in tools.reader. The significance of this is that they were supported in ClojureScript. They seem to have stopped working with recent versions of the ClojureScript compiler, but there are still legacy codebases around in which NaN, -Inf, Inf are not being read correctly.

At the least, Inf and NaN should work as symbols it would seem.

consider switching output format to enlive

right now only hiccup-like output format is supported. It might be worth exposing an enlive format as well.

See instaparse: https://github.com/Engelberg/instaparse#output-format

consider splitting whitespace per line

I received some feedback that it would be better if parcera would split the whitespace per line so that instead of having single whitespace catching all newlines, it would be split per line. I am still not sure exactly what the use case behind it is nor why it "should" be in parcera and not as a helper function in the consumer side. However, in order to remember it, I will document it here and think about it a bit more.

consider limiting conditional input

currently conditional allows anything inside the list

conditional: '#?' whitespace* list;

which for all practical purposes is pretty much everything on a lisp like Clojure. It might be better to limit it to something like (keyword ignore* form)+. However, that means that '(' and ')' would be defined on both list and conditional which is not nice.

consider flattening down the macro parsing

Currently most macro parser rule are like macro: pattern whitespace? form. Although this works it creates a "fake" hierarchy on the AST which is not inmediately visible to a developer looking at the code instead of the ast.

This also creates a weird situations when taking the discard macro into account (see #45 ).

The main problem that I see is that Clojure's LispReader is inherently stateful. Macros are used as flags to modify the behavior of the next form, however understanding what the next form is becomes incredibly difficult for things like reader conditional (see #47 ) and discard macros. There is also no guarantee that Clojure wont have more of those cases in the future.

One possible way to tackle this is to just mimick the way that the LispReader works such that macros become flags on a sequence of forms like [:list [:discard] [:quote] [:list ....]].

Although that would certainly tackle the issues above it would make validation of input much more difficult since the information required to validate a form is no longer contained only inside the form itself.

Doesn't support legacy metadata format

Before ^ was special, there was #^. The reader still works perfectly with this. See this example in the Clojure tests: https://github.com/clojure/clojure/blob/28b87d53909774af28f9f9ba6dfa2d4b94194a57/test/clojure/test_clojure/annotations/java_8.clj#L9-L16

user=> (meta #^{:a 10} {})
{:a 10}

#! interpreted as tag, not comment

#! means comment. This is used to allow for executable clojure scripts with a shebang.

consider using datafy for transformations from antlr structures

See https://clojure.github.io/clojure/branch-master/clojure.datafy-api.html

It might be better than having custom protocols to map between antlr and clojure as we currently do

consider creating an 'error' rule

There are some patterns that are too difficult to match correctly. For example:

a symbol cannot start with a number
a symbol cannot be followed by another symbol like hello/world/ => [:symbol "hello/world"] [:symbol "/]
a macro keyword cannot be ::/

Most of these issues are due to Antlr4 lack of builtin lookahead functionality. However, antlr is what I have so I need to make the best out of it. I think it should be possible to make a parser rule to match those extra cases and pass them on to the user to handle

Some numbers that start with two or more zeros and end in 'M'

I noticed in clj that:

user=> (type (read-string "0M"))
java.math.BigDecimal
user=> (type (read-string "00M"))
java.math.BigDecimal
user=> (type (read-string "000M"))
java.math.BigDecimal
user=> (type (read-string "0000M"))
java.math.BigDecimal
user=> 0000M
0M

With the master and fix/names branches I get:

user=> (require '[parcera.core :as pc])
nil
user=> (pc/ast "0M")
(:code (:number "0M"))
user=> (pc/ast "00M")
(:code (:parcera.core/failure "00M"))
user=> (pc/ast "000M")
(:code (:parcera.core/failure "000M"))
user=> (pc/ast "0000M")
(:code (:parcera.core/failure "0000M"))

So it looks like a problem when there are two or more leading zeros?

Tagged literal where literal is a set or some other forms

With master (83cd988) and fix-names (72abf7e) for a map in a tagged literal, I get:

parcera.core=> (ast "#mytag {}")
(:code 
  (:tag
    (:symbol "mytag") (:whitespace " ") 
    (:map)))

That looks good.

However, for a set, I get:

parcera.core=> (ast "#mytag #{}")
(:code 
  (:tag 
    (:symbol "mytag") (:whitespace " ")) 
  (:set))

I may be mistaken, but IIUC, one aspect of tagged literals is that one can arrange to have a function operate upon whatever was tagged to produce an arbitrary value.

On the "Reader" page at clojure.org, there is this text:

...parse the form following the tag.

via: https://clojure.org/reference/reader#tagged_literals

It's not clear to me if there are any restrictions as to what the form can be. I had assumed that pretty much anything can be made to work.

Another item in eval: conditional?

I understand that #= is not encouraged, but FWIW:

user=> #=#?(:clj (+ 1 2) :cljr (+ 1 1))
3

The current rule:

eval: '#=' ignore* (symbol | list);

parcera/src/Clojure.g4

Line 146 in 1c51356

eval: '#=' ignore* (symbol | list);

does not list conditional.

consider splitting lexer and parser

It seems that Antlr4 has some extra functionality when the lexer and the parser are separated. One recurring problem with the current grammar is that the end of a token can be the start of the next. I am currently facing this issue in #78 where \o387 is not valid octal but it is interpreted as octal + number \o3 87 which is valid input.

Maybe if I could split these it would be possible to better differentiate between tokens

Radix numbers and optional 'N' suffix

Radix numbers do not appear to take any optional 'N' (or 'M') suffix:

user=> 2r01N
Syntax error reading source at (REPL:11:0).
For input string: "1N"
user=> 2r01M
Syntax error reading source at (REPL:12:0).
For input string: "1M"

The current grammar appears to allow an optional 'N' suffix:

parcera/src/Clojure.g4

Lines 209 to 210 in 1c51356

 | [rR][0-9a-zA-Z]+ 

 )? 'N'?;

I confirmed the following with 1c51356:

user=> (require '[parcera.core :as pc])
nil
user=> (pc/ast "2r01N")
(:code (:number "2r01N"))

divide and conquer

one of the performance notes of the author of instaparse is to feed a parser smaller chunqs of text.

Although in general I am against re-writing the grammar twice, I think that I could re-use the literals and some macro rules such that I would only do "wild matching" for collections.

I could then pass each "collection" match to the clojure parser which should ideally be faster than the current solution

references:

https://github.com/Engelberg/instaparse/blob/master/docs/Performance.md

Things that can be in metadata_entry

The current list of things that can be in metadata_entry are:

( map | symbol | string | keyword | macro_keyword )

Since reader conditionals can end up as one of these values, may be it's worth adding that too?

Here is a real-world example: https://github.com/mikera/core.matrix/blob/develop/src/main/clojure/clojure/core/matrix/impl/persistent_vector.cljc#L64

I didn't find any examples for deprecated_metadata_entry, but may be that makes sense if #^ was deprecated before reader conditionals (1.7).

Fails for some unicode characters

(parcera/ast "\\u000a") produces (:code (:character "\\u") (:parcera.core/failure (:symbol "000a"))) but should produce (:code (:character "\\u000a"))

Strangely, 000a-c don't work, but 000d-f do!

Please consider adding a CHANGELOG

Hello, I bumped parcera in an old project from 0.3.1 to 0.11.5, and nothing worked anymore.
It would be great if you could add a CHANGELOG so one could check if it’s safe to bump the version or if there are breaking changes to deal with.

Thanks!

Make parcera portable again

v0.4.0 of parcera drop support for Clojurescript.

I think that this would be a great addition to parcera however I lack the knowledge in Clojurescript setup to make this work.

I got a piece of it working on this branch: https://github.com/carocad/parcera/tree/antlr-js

However the Closure compiler was incorrectly transforming the javascript files from antlr so the import never really worked.

I tested the generated code from antlr directly on node.js and it worked (see index.js) but I didnt get it working on clojurescript :/

FYI: Made alc.x-as-tests which uses parcera

Thanks for parcera.

I used it to make alc.x-as-tests. At the moment it helps one to use the content of comment blocks with appropriate content as tests:

(comment

  (+ 1 1)
  ;; => 2

)

The program will produce "derived" source by unwrapping comment blocks and rewriting certain sets of expressions as tests. The resulting source can then be executed to run the tests and receive a report.

Failure to parse some valid Clojure names

The Clojure reader accepts this code:

(defn ❤️ [x]
  (str "I ❤️ " x))

(❤️ "Clojure")

But Parcera (0.1.2) croaks on it.

These tests expose it:

(deftest simple
  (testing "character literals"
    (as-> "\\t" input (is (= input (parcera/code (parcera/clojure input)))))
    (as-> "\\n" input (is (= input (parcera/code (parcera/clojure input)))))
    (as-> "\\r" input (is (= input (parcera/code (parcera/clojure input)))))
    (as-> "\\a" input (is (= input (parcera/code (parcera/clojure input)))))
    (as-> "\\é" input (is (= input (parcera/code (parcera/clojure input)))))
    (as-> "\\ö" input (is (= input (parcera/code (parcera/clojure input)))))
    (as-> "\\ï" input (is (= input (parcera/code (parcera/clojure input)))))
    (as-> "\\ϕ" input (is (= input (parcera/code (parcera/clojure input))))))
  
  (testing "names"
    (as-> "foo" input (is (= input (parcera/code (parcera/clojure input)))))
    (as-> "foo-bar" input (is (= input (parcera/code (parcera/clojure input)))))
    (as-> "foo->bar" input (is (= input (parcera/code (parcera/clojure input)))))
    (as-> "->" input (is (= input (parcera/code (parcera/clojure input)))))
    (as-> "->as" input (is (= input (parcera/code (parcera/clojure input)))))
    (as-> "föl" input (is (= input (parcera/code (parcera/clojure input)))))
    (as-> "Öl" input (is (= input (parcera/code (parcera/clojure input)))))
    (as-> "ϕ" input (is (= input (parcera/code (parcera/clojure input)))))
    (as-> "❤️" input (is (= input (parcera/code (parcera/clojure input)))))))

All but the ❤️ test pass. And that test errors like so:

ERROR in: parcera.test.core: core.cljc, line 52: simple: names:
  error: java.lang.IllegalArgumentException: No matching clause: [:index 0] + "
  expected: (= input (parcera/code (parcera/clojure input)))

Incorrectly parses hexadecimal numbers

(parcera/ast "0x1f") results in (:code (:parcera.core/failure (:symbol "0x1f"))) with message: symbol name cannot start with a number.

deftype, defrecord, and constructor calls?

It looks like there is currently no specific support for deftype, defrecord, or constructor calls as described here: https://clojure.org/reference/reader#_deftype_defrecord_and_constructor_calls_version_1_3_and_later

I get this from ast:

parcera.core=> (ast "#my.klass_or_type_or_record[:a :b :c]")
(:code (:tag (:symbol "my.klass_or_type_or_record") (:vector (:keyword ":a") (:whitespace " ") (:keyword ":b") (:whitespace " ") (:keyword ":c"))))

and:

parcera.core=> (ast "#my.record{:a 1, :b 2}")
(:code (:tag (:symbol "my.record") (:map (:keyword ":a") (:whitespace " ") (:number "1") (:whitespace ", ") (:keyword ":b") (:whitespace " ") (:number "2"))))

So IIUC it looks like these are currently recognized as tagged literals (failure? returned nil for both FWIW).

Here is a brief transcript of a repl session for the record case:

user=> (defrecord Fun [a b])
user.Fun
user=> #user.Fun[1 2]
#user.Fun{:a 1, :b 2}
user=> #user.Fun{:a 1 :b 2}
#user.Fun{:a 1, :b 2}

I think some relevant lines in LispReader may be:

https://github.com/clojure/clojure/blob/0035cd8d73517e7475cb8b96c7911eb0c43a1a9d/src/jvm/clojure/lang/LispReader.java#L1451-L1502

Fyi, made a babashka pod wrapping parcera

A babashka pod is a command line program which can interact with babashka, a scripting tool for Clojure. The pod has to communicate via data: EDN or JSON. Since parcera returns pure EDN (I think?) it works well with the pod approach.

See:
https://github.com/babashka/pod-babashka-parcera

Incorrectly parses unquote inside var quote

From (parcera/ast "#'~a") I expected (:code (:var_quote (:unquote (:symbol "a")))) but actually got (:code (:var_quote (:symbol (:parcera.core/failure "~") "a")))

# at start of keyword supported by clojure(script) reader

Both Clojure & ClojureScript support this. Found this example in Quil:

https://github.com/quil/quil/blob/2753b568bd9f43a622ccc9972c6a189563d5041b/test/cljs/quil/snippet.cljs#L31

(:code (:simple_keyword (:parcera.core/failure "#") "results"))

Fails on chained discard

Discards can be "chained" in order to discard the following form also. Parcera is not correctly ignoring following forms.

To give an example:

{#_#_:a :b} => {}

But parcera creates this AST, note how the :b is not discarded as it would be by the Clojure Reader.

(:code                                                                               
 (:parcera.core/failure                                                              
  (:map                                                                              
   (:discard (:discard (:simple_keyword "a")))                                       
   (:whitespace " ")                                                                 
   (:simple_keyword "b"))))

symbol name pattern doesnt match unicode characters

follow up on #3 (comment)

Metadata followed by Anonymous function form

This is not parsed correctly: (parcera/ast "^Callable #(println)") produces:

(:code                                                                               
 (:metadata                                                                          
  (:metadata_entry (:symbol "Callable"))                                             
  (:whitespace " ")                                                                  
  (:tag (:symbol (:parcera.core/failure "(") "println") (:parcera.core/failure ")")))
)

Found here in the wild: https://github.com/nrepl/nREPL/blob/15ce1e96da05cabc4cd0e40f413f5cf3a3b47d02/src/clojure/nrepl/middleware/session.clj#L67

carocad / parcera Goto Github PK

parcera's People

Stargazers

Watchers

Forkers

parcera's Issues

Recommend Projects

Recommend Topics

Recommend Org