Giter VIP home page Giter VIP logo

Comments (13)

wolandscat avatar wolandscat commented on September 27, 2024

There is a remaining error in rule design to do with regexes; this will be fixed soon.

from adl-antlr.

pieterbos avatar pieterbos commented on September 27, 2024

Did a quick test of your regular expression changes. I don't know if it was meant to fix most cases already, but testing is just a minute of work with the tests of my work-in-progress parser.
.
It does not seem to parse the following:

archetype_id/value matches {/.*/}

It does however parse:

archetype_id/value matches {/openEHR-EHR-CLUSTER\.specimen\.v1/}

The messages:

 trying to parse adl2-tests/features/specialisation/openEHR-EHR-OBSERVATION.ordering_parent
line 62:40 token recognition error at: '.*'
line 62:42 extraneous input '/' expecting {'(', ')', '=', '[', ']', '{', '}', ',', '*', ':', '-', '^', '!=', '<=', '<', '>=', '>', '+', '_', '\.', '\', '|', ROOT_ID_CODE, ID_CODE, AT_CODE, AC_CODE, DATE_CONSTRAINT_PATTERN, TIME_CONSTRAINT_PATTERN, DATE_TIME_CONSTRAINT_PATTERN, DURATION_CONSTRAINT_PATTERN, SYM_ARCHETYPE, SYM_TEMPLATE_OVERLAY, SYM_TEMPLATE, SYM_OPERATIONAL_TEMPLATE, SYM_SPECIALIZE, SYM_LANGUAGE, SYM_DESCRIPTION, SYM_DEFINITION, SYM_RULES, SYM_TERMINOLOGY, SYM_ANNOTATIONS, SYM_COMPONENT_TERMINOLOGIES, SYM_ADL_VERSION, SYM_RM_RELEASE, SYM_IS_CONTROLLED, SYM_IS_GENERATED, SYM_UID, SYM_BUILD_UID, SYM_EXISTENCE, SYM_OCCURRENCES, SYM_CARDINALITY, SYM_ORDERED, SYM_UNORDERED, SYM_UNIQUE, SYM_USE_NODE, SYM_USE_ARCHETYPE, SYM_ALLOW_ARCHETYPE, SYM_INCLUDE, SYM_EXCLUDE, SYM_AFTER, SYM_BEFORE, SYM_CLOSED, SYM_THEN, SYM_AND, SYM_OR, SYM_XOR, SYM_NOT, SYM_IMPLIES, SYM_FOR_ALL, SYM_EXISTS, SYM_MATCHES, '...', '..', WS, '
', '--------------------*', CMT_LINE, ISO8601_DATE, ISO8601_TIME, ISO8601_DATE_TIME, ISO8601_DURATION, SYM_TRUE, SYM_FALSE, ARCHETYPE_HRID, ARCHETYPE_REF, ARCHETYPE_HRID_ROOT, VERSION_ID, NAMESPACE, GUID, ALPHA_UC_ID, ALPHA_LC_ID, TERM_CODE_REF, URI, INTEGER, REAL, STRING, CHARACTER, ';'}
line 67:24 missing '}' at 'value'
line 67:48 no viable alternative at input '-EHR'
line 67:93 mismatched input '_' expecting {'=', '/', '*', '-', '^', '!=', '<=', '<', '>=', '>', '+'}
line 68:10 extraneous input 'exclude' expecting {'}', SYM_USE_NODE, SYM_USE_ARCHETYPE, SYM_ALLOW_ARCHETYPE, ALPHA_UC_ID}
line 69:40 token recognition error at: '.*'
line 69:42 extraneous input '/' expecting {'(', ')', '=', '[', ']', '{', '}', ',', '*', ':', '-', '^', '!=', '<=', '<', '>=', '>', '+', '_', '\.', '\', '|', ROOT_ID_CODE, ID_CODE, AT_CODE, AC_CODE, DATE_CONSTRAINT_PATTERN, TIME_CONSTRAINT_PATTERN, DATE_TIME_CONSTRAINT_PATTERN, DURATION_CONSTRAINT_PATTERN, SYM_ARCHETYPE, SYM_TEMPLATE_OVERLAY, SYM_TEMPLATE, SYM_OPERATIONAL_TEMPLATE, SYM_SPECIALIZE, SYM_LANGUAGE, SYM_DESCRIPTION, SYM_DEFINITION, SYM_RULES, SYM_TERMINOLOGY, SYM_ANNOTATIONS, SYM_COMPONENT_TERMINOLOGIES, SYM_ADL_VERSION, SYM_RM_RELEASE, SYM_IS_CONTROLLED, SYM_IS_GENERATED, SYM_UID, SYM_BUILD_UID, SYM_EXISTENCE, SYM_OCCURRENCES, SYM_CARDINALITY, SYM_ORDERED, SYM_UNORDERED, SYM_UNIQUE, SYM_USE_NODE, SYM_USE_ARCHETYPE, SYM_ALLOW_ARCHETYPE, SYM_INCLUDE, SYM_EXCLUDE, SYM_AFTER, SYM_BEFORE, SYM_CLOSED, SYM_THEN, SYM_AND, SYM_OR, SYM_XOR, SYM_NOT, SYM_IMPLIES, SYM_FOR_ALL, SYM_EXISTS, SYM_MATCHES, '...', '..', WS, '
', '--------------------*', CMT_LINE, ISO8601_DATE, ISO8601_TIME, ISO8601_DATE_TIME, ISO8601_DURATION, SYM_TRUE, SYM_FALSE, ARCHETYPE_HRID, ARCHETYPE_REF, A

followed by more errors due to no multiline-string support yet.

Directly before logging these errors, ANTLR also logs that it attempts full context to my ANTLRErrorListener. Haven't checked what it means yet.

from adl-antlr.

pieterbos avatar pieterbos commented on September 27, 2024

I don't see any more regular expression parse errors for now. But I have not done a full test, for example i have not tested regular expressions with for non-western characters.

from adl-antlr.

wolandscat avatar wolandscat commented on September 27, 2024

This problem is not yet fixed; the current 'fix' is an unreliable band-aid - I am still researching the best solution for this one.

from adl-antlr.

pieterbos avatar pieterbos commented on September 27, 2024

I know of some people in our company who have experience with ANTLR. I'll try asking them for suggestions.

I do think this is a language construct that is hard to parse the way it is now. Also because the following ADL seems to be valid and has a use valid case, but is not a regular expression:

definition
    OBSERVATION[id1.1] matches {   /data[id2]/events[id7]/data[id4]/items matches { /data[id5]/value matches {

A rather ugly solution would be to create a preprocessor that matches all regular expressions with a line-based scanner, then replace all the '/'-characters at the beginning and end with '^'. Only then run the lexer.

from adl-antlr.

wolandscat avatar wolandscat commented on September 27, 2024

The current rules for regex are a nasty hack; they are in the cadl_primitives.g4 file, and are currently as follows:

// for REGEX: strip first and last char, and then process with PCRE grammar
c_string: ( string_value | string_list_value | regex_constraint ) assumed_string_value? ;
assumed_string_value: ';' string_value ;
regex_constraint: '/' regex1 '/' | '^' regex2 '^' ;
regex1: ( '_' | '.*' | '\\.' | '\\/' | ~'/' )+ ; // TODO: not clear why first 3 matches are needed, but they work.
regex2: ( '_' | '.*' | '\\.' | '\\/' | ~'^' )+ ;

The remaining problem is that if you introduce Lexer rules for the regex part, a regex expression can have almost anything in it, and the lexer will then start matching other strings of characters that turn up between '/' characters, i.e. in paths.

I originally posted to the Antlr google group on this, asking if there was a way to use the fact that the previous token was a '{' to conditionally turn the regex matching rule on; I didn't get any clear answer.

One idea is simply to say that regexes in ADL2 have to be between some character other than '/', e.g. the alternate '^' character included in the current grammar. This means I have to change the ADL Workbench to convert all the existing regexes in slots to ^^ form rather than // form; probably not hard, and maybe that's what we should do. But I have this nagging feeling something is wrong - matching regexes correctly was dead easy in my old yacc/lex grammars in Eiffel. Antlr is difficult for solving some simple problems!

from adl-antlr.

pieterbos avatar pieterbos commented on September 27, 2024

Ah, but i know a way to possibly do that. It's called Lexical Mode. See:

https://theantlrguy.atlassian.net/wiki/display/ANTLR4/Lexer+Rules#LexerRules-LexicalModes

You can use the three commands mode, popMode and pushMode like you use the skip command that you use on whitespace and comments, like so:

TokenName : «alternative» -> command-name
TokenName : «alternative» -> command-name («identifier or integer»)

Then you can specify lexer rules to only exist in a specific mode. So if you do something like:

LPARENT: '{' -> mode('regexp_allowed')

You can switch the lexer to a mode where it allows regular expressions, when it normally does not. Then two questions remain:

  1. When do you switch back to the default mode? regexpes can occur directly, or within a tuple, so after anything that is not a regular expression or the start of a tuple?
  2. In templates, you can type a path directly after a 'matches {', and it's a path, not a regular expression. How to solve the possible ambiguity there?

from adl-antlr.

wolandscat avatar wolandscat commented on September 27, 2024

Yep I know about lexical mode. I think it won't help here because there is no way to know when to enter the 'regex allowed' mode. The rule you propose will enter it just because there is a '{', but that symbol is ubiquitous in ADL, and pretty much every possible CADL element can come after it. So that implies that the 'regex allowed' mode is pretty much all of the CADL rules as they now are.

I couldn't see a way to define such a rule that would actually work, and also keep the grammar comprehensible. Happy to be proved wrong!

from adl-antlr.

 avatar commented on September 27, 2024

Maybe it is an idea to start the another lexical mode by {/ and end it by /}?

from adl-antlr.

wolandscat avatar wolandscat commented on September 27, 2024

I don't know if that can work or not - note that the '{' matching is done inline in the parser rules, which is how I think it should be. If we introduce a lexer pattern for '{/' how does it interact with that other matching?

from adl-antlr.

 avatar commented on September 27, 2024

On 28-10-15 15:00, Thomas Beale wrote:

I don't know if that can work or not - note that the '{' matching is
done inline in the parser rules, which is how I think it should be. If
we introduce a lexer pattern for '{/' how does it interact with that
other matching?


Reply to this email directly or view it on GitHub
#7 (comment).

I have it from the book The Definitive ANTLR4 Reference by Terence Parr.
Page 223 where the issue of XML is used as example.
In XML, text inside the tags must be parsed different from text between
the tags., it is in fact the same problem, and because in ADL (1.4)
regular expressions were always between {/ /}, it could work.

The only thing I have not tried is if it is possible to change the lexer
mode with two characters instead of one (as in the book example), but,
when they are tokenized, I think it could work

from adl-antlr.

 avatar commented on September 27, 2024

I wonder, has it been tried, changing the lexer mode when {/ or /} occurs? Or is it not a viable solution?

You can always say, that if it is not possible for a grammar to distinguish a mode for certainty, that something is wrong with that grammar.

There is a rule that if more then one lexer rule match the same input sequence, the priority goes to the first occuring rule. Another rule says that the lexer recognizes the most input characters

So if you have the token {/ defined above the token {

It is easy to check by creating a small test grammar.

grammar Test;
r
: TEXT
| LEFT_CURLY_PLUS_SLASH TEXT SLASH_PLUS_RIGHT_CURLY
| LEFT_CURLY TEXT RIGHT_CURLY
;
LEFT_CURLY_PLUS_SLASH: '{/' ;
LEFT_CURLY: '{' ;
SLASH_PLUS_RIGHT_CURLY: '/}' ;
RIGHT_CURLY: '}' ;
TEXT:[a-z]+ ;
WS: [ \r\t\n]+ -> skip;

You can see that it recognizes {/ as a different token from {
{/abc/}
[@0,0:1='{/',<1>,1:0]
[@1,2:4='abc',<5>,1:2]
[@2,5:6='/}',<3>,1:5]
[@3,8:7='',<-1>,2:0]

{abc}
[@0,0:0='{',<2>,1:0]
[@1,1:3='abc',<5>,1:1]
[@2,4:4='}',<4>,1:4]
[@3,6:5='',<-1>,2:0]

Here the rule does not exist, and it reports an error, but parses the rest.
{{/abc}
[@0,0:0='{',<2>,1:0]
[@1,1:2='{/',<1>,1:1]
[@2,3:5='abc',<5>,1:3]
[@3,6:6='}',<4>,1:6]
[@4,8:7='',<-1>,2:0]
line 1:1 extraneous input '{/' expecting TEXT

Here also
{/{abc/}
[@0,0:1='{/',<1>,1:0]
[@1,2:2='{',<2>,1:2]
[@2,3:5='abc',<5>,1:3]
[@3,6:7='/}',<3>,1:6]
[@4,9:8='',<-1>,2:0]
line 1:2 extraneous input '{' expecting TEXT

from adl-antlr.

pieterbos avatar pieterbos commented on September 27, 2024

I just fixed it with a rather simple pragmatich approach: match { / REGEXP ASSUMED_VALUE? / } as one token in the lexer, plus optional whitespace.
Now all you have to do is parse that token in your treewalker. I did the parsing with a very small separate ANTLR-grammar. Solution is in:

nedap/archie@aaab8ea

I implemented tests, and I could not break it, although I did find #20 (unrelated to regular expressions, but prevents me from writing more tests).

This is simple and small and it covers all cases. You could even do it in the parser or lexer if you allow java/target language code in your grammar.
Other possible solutions that will work:

  • Add a preprocessor which rewrites the regexpes starting and ending with '^', before the lexer
  • Manipulate the token stream between the lexer and the parser, and add some extra tokens in java
  • write custom lexer code in java(target language) in your grammar
  • redefine ADL so that regular expressions always start with a '^'

Why matching '{' will not work without custom java code in your lexer:

apart from that you have to match '{ WS* /' instead of '{/', the following valid ADL is the reason:

TYPE[id1] matches {/start/test matches {/this should work/}}

'/start/test' is a path, '/this should work/' is a regexp.

So you have to look ahead to ahead of the next / before you can determine it's a regular expression or a path. And you cannot do that with lexical modes.

from adl-antlr.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.