carocad / parcera Goto Github PK
View Code? Open in Web Editor NEWGrammar-based Clojure(script) parser
License: GNU Lesser General Public License v3.0
Grammar-based Clojure(script) parser
License: GNU Lesser General Public License v3.0
This is perfectly valid code, but parcera cannot handle it:
(parcera/ast "#a #b 1")
;; =>
(:code
(:tag (:symbol "a") (:whitespace " ") (:parcera.core/failure "#") (:symbol "b"))
(:whitespace " ")
(:number "1"))
AFAICT, octal characters (not numbers -- e.g. \o013
) do not appear to be used very widely (at least not based on my sampling of code on clojars).
Here is a real-world example:
IIUC, one can have one, two, or three octal "digits" after the \o
, but when there are three digits, the left-most digit should be less than or equal to 3. So for example, these work:
\o1
\o77
\o377
But this doesn't:
\o400
If it seems worth adding support for this, for parsing of the three-digit case, perhaps it is not worth trying to make the left-most digit's value constrained.
FWIW, I think some relevant code in clojure's source is:
Hopefully this would make the setup of the project a bit more simple (less tooling required)
Currently we check that the parcera can do a roundtrip for any input string. The problem with this approach is that it doesnt take ambiguity into account.
For example:
~@hello
could be parsed as:
symbol
unquote
+ @hello
(symbol)unquote-splicing
+ symbolRight now parcera doesnt check that even thought it should.
The other problem is that parcera only checks the "accepted" parsed AST. It doesnt check that no other interpretations are available. So it cannot guarantee that it will always return the same value.
It seems to me that the [:whitespace ""]
nodes aren't necessary:
user=> (parcera/clojure "{:a 1}")
[:code [:whitespace ""] [:map [:map-content [:whitespace ""] [:simple-keyword "a"] [:whitespace " "] [:whitespace ""] [:number "1"] [:whitespace ""]]] [:whitespace ""]]
Why is the result wrapped in a [:code ...]
vector? Can it be something else than code?
currently I use a nested lazy sequence transformation to transform the result from Antlr4 into a Clojure data structure.
This provides a good performance since the transformation is lazy and the user only pays for what they consume. Unfortunately this transformation incurs in some overhead (not sure the exact cause ... maybe just the price of immutability ? or the deep nested stack ?), last time that I checked antlr4 was able to parse clojure.core in around 30ms but the transformation adds another 200ms 150ms to realise all sequences.
I see several options:
(java.util.Collections/unmodifiableList (new java.util.ArrayList))
on a replThe current list of things in the grammar that can have metadata on them is:
( symbol
| collection
| set
| namespaced_map
| tag
| fn
| unquote
| unquote_splicing
It appears that there are more possibilities. Additional examples for:
| conditional
| deref
| quote
| backtick
| var_quote
are respectively:
In this list, deref seems to be the most common, followed by quote, backtick, and conditional. I only encountered one example for var-quote.
The clojure.org metadata page currently says:
Note that metadata reader macros are applied at read-time, not at evaluation-time, and can only be used with values that support metadata, like symbols, vars, collections, sequences, namespaces, refs, atoms, agents, etc. Some important exceptions that don’t support metadata are strings, numbers, booleans, Java objects, keywords (these are cached and can be shared within the runtime), and deftypes (unless they explicitly implement clojure.lang.IMeta).
The "etc." part seems to be able to include a fair bit -- as long as it implements `clojure.lang.IMeta" I guess, and IIUC this is not something that one can necessarily tell in all cases without running the code.
I haven't gone through the grammar exhaustively to check for other possibilities so I suppose there could be more...
I was assuming both clojure and clojurescript to follow the reader reference implementation: https://clojure.org/reference/reader
However it seems that there are differences (both on purpose or accidental) in the way that the reader behaves.
I think that in order to avoid blowing my mind (and compromising parcera
's future development) I should just allow the user to be able to control how much validation they want to enforce. This could simply be a flag on parcera/ast
function.
In clj
I get:
user=> [1#_ 2]
[1]
In parcera I get:
parcera.core=> (ast "[1#_ 2]")
(:code (:vector (:symbol "1#_") (:whitespace " ") (:number "2")))
Possibly a bit simpler were:
parcera.core=> (ast "1#_ 2")
(:code (:symbol "1#_") (:whitespace " ") (:number "2"))
and:
parcera.core=> (ast "1#_2")
(:code (:symbol "1#_2"))
This is with: 1c51356
#?@()
works, #?@ ()
fails.
#?@() => (:code (:conditional_splicing))
#?@ () => (:code (:tag (:symbol "?") (:parcera.core/failure "@") (:whitespace " ") (:list)))
Based on this presentation about Clojure performance: https://www.youtube.com/watch?v=3SSHjKT3ZmA&t=1345s
I think that replacing the Listener functions with records on the instaparse algorithm could yield a significant performance improvement.
References:
Based on this presentation about Clojure performance: https://www.youtube.com/watch?v=3SSHjKT3ZmA&t=1345s
I think that replacing the atoms in the Trampoline data structure with mutable data structures could yield a significant performance improvement.
References:
Currently parcera requires the StringBuilder class from Java in order to make the writing process fast ( Clojure's str
is too slow on large input).
One option mentioned by @borkdude is to use StringBuffer from the google library to support the same on Clojurescript.
Depending on the performance and added complexity I should consider this as a long term goal.
Open question:
(parcera/ast ":1")
=> clojure.lang.PersistentList cannot be cast to java.lang.CharSequence
This problem seems to be twofold:
:simple_keyword
will always have a String for it's child, so blows up if the keyword works out invalid (as it isn't a string but a list with the failure info):
after that I guess it's fair game with the rest of symbol?) the Clojure reader does accept it, so does the ClojureScript reader.e.g. {#?@()}
is valid, but parcera fails with: (:code (:parcera.core/failure (:map (:conditional_splicing))))
, message: :message "((:conditional_splicing)) - failed: (even? (count %)) spec: :parcera.spec/map\n"
It looks like if there is a comment right after some metadata the parse result may be off:
(pc/ast "^{:a true} ;; hello\n [:a]")
#_
(:code
(:metadata
(:metadata_entry
(:map
(:keyword ":a") (:whitespace " ")
(:symbol "true")))
(:whitespace " "))
(:comment ";; hello")
(:whitespace "\n ")
(:vector
(:keyword ":a")))
Note that the comment, last whitespace, and vector are siblings of metadata and not children.
This is in contrast to:
(pc/ast "^{:a true} [:a]")
#_
(:code
(:metadata
(:metadata_entry
(:map
(:keyword ":a") (:whitespace " ")
(:symbol "true")))
(:whitespace " ")
(:vector (:keyword ":a"))))
where vector is a child of metadata.
I encountered this in the wild: https://github.com/lspector/Clojush/blob/master/src/clojush/instructions/code.clj#L166-L176
Before Clojure added an official ##NaN
, there was literals provided in tools.reader. The significance of this is that they were supported in ClojureScript. They seem to have stopped working with recent versions of the ClojureScript compiler, but there are still legacy codebases around in which NaN
, -Inf
, Inf
are not being read correctly.
At the least, Inf
and NaN
should work as symbols it would seem.
right now only hiccup-like output format is supported. It might be worth exposing an enlive format as well.
See instaparse: https://github.com/Engelberg/instaparse#output-format
I received some feedback that it would be better if parcera would split the whitespace per line so that instead of having single whitespace catching all newlines, it would be split per line. I am still not sure exactly what the use case behind it is nor why it "should" be in parcera and not as a helper function in the consumer side. However, in order to remember it, I will document it here and think about it a bit more.
currently conditional allows anything inside the list
conditional: '#?' whitespace* list;
which for all practical purposes is pretty much everything on a lisp like Clojure. It might be better to limit it to something like (keyword ignore* form)+
. However, that means that '('
and ')'
would be defined on both list
and conditional
which is not nice.
Currently most macro parser rule are like macro: pattern whitespace? form
. Although this works it creates a "fake" hierarchy on the AST which is not inmediately visible to a developer looking at the code instead of the ast.
This also creates a weird situations when taking the discard
macro into account (see #45 ).
The main problem that I see is that Clojure's LispReader is inherently stateful. Macros are used as flags to modify the behavior of the next form, however understanding what the next form is becomes incredibly difficult for things like reader conditional (see #47 ) and discard macros. There is also no guarantee that Clojure wont have more of those cases in the future.
One possible way to tackle this is to just mimick the way that the LispReader works such that macros become flags on a sequence of forms like [:list [:discard] [:quote] [:list ....]]
.
Although that would certainly tackle the issues above it would make validation of input much more difficult since the information required to validate a form is no longer contained only inside the form itself.
Before ^
was special, there was #^
. The reader still works perfectly with this. See this example in the Clojure tests: https://github.com/clojure/clojure/blob/28b87d53909774af28f9f9ba6dfa2d4b94194a57/test/clojure/test_clojure/annotations/java_8.clj#L9-L16
user=> (meta #^{:a 10} {})
{:a 10}
#!
means comment. This is used to allow for executable clojure scripts with a shebang.
See https://clojure.github.io/clojure/branch-master/clojure.datafy-api.html
It might be better than having custom protocols to map between antlr and clojure as we currently do
There are some patterns that are too difficult to match correctly. For example:
hello/world/
=> [:symbol "hello/world"] [:symbol "/]
::/
Most of these issues are due to Antlr4
lack of builtin lookahead functionality. However, antlr is what I have so I need to make the best out of it. I think it should be possible to make a parser rule to match those extra cases and pass them on to the user to handle
I noticed in clj
that:
user=> (type (read-string "0M"))
java.math.BigDecimal
user=> (type (read-string "00M"))
java.math.BigDecimal
user=> (type (read-string "000M"))
java.math.BigDecimal
user=> (type (read-string "0000M"))
java.math.BigDecimal
user=> 0000M
0M
With the master
and fix/names
branches I get:
user=> (require '[parcera.core :as pc])
nil
user=> (pc/ast "0M")
(:code (:number "0M"))
user=> (pc/ast "00M")
(:code (:parcera.core/failure "00M"))
user=> (pc/ast "000M")
(:code (:parcera.core/failure "000M"))
user=> (pc/ast "0000M")
(:code (:parcera.core/failure "0000M"))
So it looks like a problem when there are two or more leading zeros?
With master (83cd988) and fix-names (72abf7e) for a map in a tagged literal, I get:
parcera.core=> (ast "#mytag {}")
(:code
(:tag
(:symbol "mytag") (:whitespace " ")
(:map)))
That looks good.
However, for a set, I get:
parcera.core=> (ast "#mytag #{}")
(:code
(:tag
(:symbol "mytag") (:whitespace " "))
(:set))
I may be mistaken, but IIUC, one aspect of tagged literals is that one can arrange to have a function operate upon whatever was tagged to produce an arbitrary value.
On the "Reader" page at clojure.org, there is this text:
...parse the form following the tag.
via: https://clojure.org/reference/reader#tagged_literals
It's not clear to me if there are any restrictions as to what the form can be. I had assumed that pretty much anything can be made to work.
I understand that #=
is not encouraged, but FWIW:
user=> #=#?(:clj (+ 1 2) :cljr (+ 1 1))
3
The current rule:
eval: '#=' ignore* (symbol | list);
Line 146 in 1c51356
does not list conditional
.
It seems that Antlr4 has some extra functionality when the lexer and the parser are separated. One recurring problem with the current grammar is that the end of a token can be the start of the next. I am currently facing this issue in #78 where \o387
is not valid octal but it is interpreted as octal + number \o3 87
which is valid input.
Maybe if I could split these it would be possible to better differentiate between tokens
Radix numbers do not appear to take any optional 'N' (or 'M') suffix:
user=> 2r01N
Syntax error reading source at (REPL:11:0).
For input string: "1N"
user=> 2r01M
Syntax error reading source at (REPL:12:0).
For input string: "1M"
The current grammar appears to allow an optional 'N' suffix:
Lines 209 to 210 in 1c51356
I confirmed the following with 1c51356:
user=> (require '[parcera.core :as pc])
nil
user=> (pc/ast "2r01N")
(:code (:number "2r01N"))
one of the performance notes of the author of instaparse is to feed a parser smaller chunqs of text.
Although in general I am against re-writing the grammar twice, I think that I could re-use the literals and some macro rules such that I would only do "wild matching" for collections.
I could then pass each "collection" match to the clojure
parser which should ideally be faster than the current solution
references:
The current list of things that can be in metadata_entry are:
( map | symbol | string | keyword | macro_keyword )
Since reader conditionals can end up as one of these values, may be it's worth adding that too?
Here is a real-world example: https://github.com/mikera/core.matrix/blob/develop/src/main/clojure/clojure/core/matrix/impl/persistent_vector.cljc#L64
I didn't find any examples for deprecated_metadata_entry, but may be that makes sense if #^
was deprecated before reader conditionals (1.7).
(parcera/ast "\\u000a")
produces (:code (:character "\\u") (:parcera.core/failure (:symbol "000a")))
but should produce (:code (:character "\\u000a"))
Strangely, 000a-c don't work, but 000d-f do!
Hello, I bumped parcera in an old project from 0.3.1 to 0.11.5, and nothing worked anymore.
It would be great if you could add a CHANGELOG so one could check if it’s safe to bump the version or if there are breaking changes to deal with.
Thanks!
v0.4.0 of parcera drop support for Clojurescript.
I think that this would be a great addition to parcera however I lack the knowledge in Clojurescript setup to make this work.
I got a piece of it working on this branch: https://github.com/carocad/parcera/tree/antlr-js
However the Closure compiler was incorrectly transforming the javascript files from antlr so the import never really worked.
I tested the generated code from antlr directly on node.js and it worked (see index.js) but I didnt get it working on clojurescript :/
Thanks for parcera.
I used it to make alc.x-as-tests. At the moment it helps one to use the content of comment
blocks with appropriate content as tests:
(comment
(+ 1 1)
;; => 2
)
The program will produce "derived" source by unwrapping comment blocks and rewriting certain sets of expressions as tests. The resulting source can then be executed to run the tests and receive a report.
The Clojure reader accepts this code:
(defn ❤️ [x]
(str "I ❤️ " x))
(❤️ "Clojure")
But Parcera (0.1.2
) croaks on it.
These tests expose it:
(deftest simple
(testing "character literals"
(as-> "\\t" input (is (= input (parcera/code (parcera/clojure input)))))
(as-> "\\n" input (is (= input (parcera/code (parcera/clojure input)))))
(as-> "\\r" input (is (= input (parcera/code (parcera/clojure input)))))
(as-> "\\a" input (is (= input (parcera/code (parcera/clojure input)))))
(as-> "\\é" input (is (= input (parcera/code (parcera/clojure input)))))
(as-> "\\ö" input (is (= input (parcera/code (parcera/clojure input)))))
(as-> "\\ï" input (is (= input (parcera/code (parcera/clojure input)))))
(as-> "\\ϕ" input (is (= input (parcera/code (parcera/clojure input))))))
(testing "names"
(as-> "foo" input (is (= input (parcera/code (parcera/clojure input)))))
(as-> "foo-bar" input (is (= input (parcera/code (parcera/clojure input)))))
(as-> "foo->bar" input (is (= input (parcera/code (parcera/clojure input)))))
(as-> "->" input (is (= input (parcera/code (parcera/clojure input)))))
(as-> "->as" input (is (= input (parcera/code (parcera/clojure input)))))
(as-> "föl" input (is (= input (parcera/code (parcera/clojure input)))))
(as-> "Öl" input (is (= input (parcera/code (parcera/clojure input)))))
(as-> "ϕ" input (is (= input (parcera/code (parcera/clojure input)))))
(as-> "❤️" input (is (= input (parcera/code (parcera/clojure input)))))))
All but the ❤️ test pass. And that test errors like so:
ERROR in: parcera.test.core: core.cljc, line 52: simple: names:
error: java.lang.IllegalArgumentException: No matching clause: [:index 0] + "
expected: (= input (parcera/code (parcera/clojure input)))
(parcera/ast "0x1f")
results in (:code (:parcera.core/failure (:symbol "0x1f")))
with message: symbol name cannot start with a number
.
It looks like there is currently no specific support for deftype, defrecord, or constructor calls as described here: https://clojure.org/reference/reader#_deftype_defrecord_and_constructor_calls_version_1_3_and_later
I get this from ast
:
parcera.core=> (ast "#my.klass_or_type_or_record[:a :b :c]")
(:code (:tag (:symbol "my.klass_or_type_or_record") (:vector (:keyword ":a") (:whitespace " ") (:keyword ":b") (:whitespace " ") (:keyword ":c"))))
and:
parcera.core=> (ast "#my.record{:a 1, :b 2}")
(:code (:tag (:symbol "my.record") (:map (:keyword ":a") (:whitespace " ") (:number "1") (:whitespace ", ") (:keyword ":b") (:whitespace " ") (:number "2"))))
So IIUC it looks like these are currently recognized as tagged literals (failure?
returned nil for both FWIW).
Here is a brief transcript of a repl session for the record case:
user=> (defrecord Fun [a b])
user.Fun
user=> #user.Fun[1 2]
#user.Fun{:a 1, :b 2}
user=> #user.Fun{:a 1 :b 2}
#user.Fun{:a 1, :b 2}
I think some relevant lines in LispReader may be:
A babashka pod is a command line program which can interact with babashka, a scripting tool for Clojure. The pod has to communicate via data: EDN or JSON. Since parcera returns pure EDN (I think?) it works well with the pod approach.
From (parcera/ast "#'~a")
I expected (:code (:var_quote (:unquote (:symbol "a"))))
but actually got (:code (:var_quote (:symbol (:parcera.core/failure "~") "a")))
Both Clojure & ClojureScript support this. Found this example in Quil:
(:code (:simple_keyword (:parcera.core/failure "#") "results"))
Discards can be "chained" in order to discard the following form also. Parcera is not correctly ignoring following forms.
To give an example:
{#_#_:a :b} => {}
But parcera creates this AST, note how the :b
is not discarded as it would be by the Clojure Reader.
(:code
(:parcera.core/failure
(:map
(:discard (:discard (:simple_keyword "a")))
(:whitespace " ")
(:simple_keyword "b"))))
follow up on #3 (comment)
This is not parsed correctly: (parcera/ast "^Callable #(println)")
produces:
(:code
(:metadata
(:metadata_entry (:symbol "Callable"))
(:whitespace " ")
(:tag (:symbol (:parcera.core/failure "(") "println") (:parcera.core/failure ")")))
)
Found here in the wild: https://github.com/nrepl/nREPL/blob/15ce1e96da05cabc4cd0e40f413f5cf3a3b47d02/src/clojure/nrepl/middleware/session.clj#L67
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.