openehr / adl-antlr Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 4.0 199 KB

Antrl4 grammars for ADL

License: Apache License 2.0

ANTLR 98.24% Emacs Lisp 1.76%

adl-antlr's People

Contributors

Stargazers

Watchers

Forkers

serefarikan nedap pardashealth clarotech

adl-antlr's Issues

attributes without cardinality or existence?

In the archetype https://github.com/openEHR/adl-archetypes/blob/master/ADL2-reference/features/aom_structures/basic/openEHR-TEST_PKG-WHOLE.primitive_types.v1.adls the following snippet of ADL exists:

WHOLE[id1] matches {    -- test entry
        string_attr1 matches {"something"}
        string_attr2 matches {/this|that|something else/}
        string_attr3 matches {/cardio.*/}
        string_attr4 
}

This does not parse with the new grammar due to the following definition of c_attribute:

c_attribute: adl_dir? rm_attribute_id ( c_existence | c_cardinality | c_existence c_cardinality )
    | adl_dir? rm_attribute_id c_existence? c_cardinality? SYM_MATCHES '{' c_objects '}'
    ;

Not sure if this is a bug in the grammar, or an incorrect test file.

lexer errors on regular expressions

There is no regular expression lexer rule present. So if you try to parse for example

            allow_archetype OBSERVATION[id2] occurrences matches {0..1} matches {   -- Vital signs
                include
                    archetype_id/value matches {/openEHR-EHR-OBSERVATION\.exam(-a-zA-Z0-9_]+)*\.v1/}
                exclude
                    archetype_id/value matches {/openEHR-EHR-OBSERVATION\.blood_pressure(-a-zA-Z0-9_]+)*\.v1/}
            }

You get lexer errors:

trying to parse adl2-tests/validity/slots/openEHR-EHR-SECTION.VDSEV_slot_include_not_any_exclude_not_any.v1.adls
line 30:74 token recognition error at: '_'
line 32:84 token recognition error at: '_'

/path/to/dv_boolean = true not possible

Due to the strict separation of boolean and arithmetic logic, /path/to/dv_boolean = true is not possible. This limits score calculations like:

probably_delirium:
        /data[id2]/events[id3]/data[id4]/items[id38]/value/magnitude >= 3 implies /data[id2]/events[id3]/data[id4]/items[id41]/value/value = true
no_delirium:
        /data[id2]/events[id3]/data[id4]/items[id38]/value/magnitude < 3 implies /data[id2]/events[id3]/data[id4]/items[id41]/value/value = false

Also no support for at-codes for dv_coded_text, no string support in '==' and '!=' operators, as well as probably quite a lot of other types.

I'm currently creating a version that does this. It will lose some type-safety checks in the grammar, but you have to do those in code anyway for checking the types of path lookup. It is very likely that we need an assertion-separator sign, or a whitespace sensitive grammar.

Small possible improvements to make parsing easier

I made some small adaptations to your parser to make parsing just that little bit easier. This does not improve the functionality of the parser, only the usability.

See nedap/archie@0109786

you changed some Lexer tokens to strings in the metadata section. This means i now have to match the actual string in java-code to match this. Changed it to named parser rules so i can use those
added a SEMICOLON lexer token, So that can be used in the parsing of terminology code constraints with assumed values. Could also be solved (perhaps better?) with a different parser rule
c_string now directly has a STRING instead of a string_value. Same content, but not consistent with the rest of the grammar and therefore can require some extra code.

Rule grammar incomplete

Currently I'm implementing rules parsing. So far i've implemented what is in the grammar. Then I noticed the rules grammar is incomplete. At least missing is:

variable declarations
functions
queries

Also I cannot find an existing testset for the rules, and no open source rule parsing library. They look very useful for several things (score calculation, conditional questions, custom validations, etc...) and we certainly want to support them, including variables and queries. But the lack of open source libraries supporting this, and the lack of use in the CKM makes me wonder - is this actually being used?

adl rules have no operator precedence

The operators in the rules section have no operator precedence. It's very useful to include this in the antlr grammar. Something along the lines of:

expr:  mult ('+' mult)* ;
mult:  atom ('*' atom)* ;
atom:  INT | '(' expr ')' ;

You can then simply walk the tree to evaluate expressions.

Useful both for arithmetic and boolean expressions.

Time constraint parsing gives errors

    time_attr1 matches {hh:mm:ss}
        time_attr2 matches {hh:mm:??}
        time_attr3 matches {hh:mm:XX}
        time_attr4 matches {hh:??:??}
        time_attr5 matches {hh:??:XX}

gives errors:

[Test worker] INFO com.nedap.archie.adlparser.LargeSetOfADLsTest - trying to parse adl2-tests/features/aom_structures/basic/openEHR-TEST_PKG-WHOLE.assumed_values.v1.adls
[Test worker] WARN com.nedap.archie.adlparser.ADLErrorListener - FULL CONTEXT: 135-135, alts: {2, 5}
[Test worker] WARN com.nedap.archie.adlparser.ADLErrorListener - FULL AMBIGUITY: 135-135, exact: false
line 59:22 no viable alternative at input 'hh:mm:ss;'
line 60:22 no viable alternative at input 'hh:mm:XX;'
line 61:22 no viable alternative at input 'hh:??:XX;'
line 62:22 no viable alternative at input 'hh:??:??;'

I haven't checked the token stream, but the ambiguity warning at token 135 could be a cause.

Recommendations

If you name the grammars following the CamelCase convention, then the generated Java files will follow this too. Like this:
grammar OdinValues;
import BasePatterns;
If you name the directory in which the grammars are below "src/main/antlr4/.... (package)..." Maven finds them by default.
If you have under that directory a directory-structure, then Maven will add that directory structure in the package of the generated Java-files.
Only the root-grammars should be in that directory, all other referred grammars should be in "src/main/antlr4/imports"

By the way, the Maven plugin is configured default as this:
http://www.antlr.org/api/maven-plugin/latest/usage.html

(the version in this example should be 4.5)

Generating Javafiles will be then: mvn antlr4:antlr4
The Javafiles will be then in target/generated-sources/antlr4/....(package)...

You find all worked out in my fork:
https://github.com/BertVerhees/adl-antlr

If you do it like this, handling the grammars will be very easy for Java-programmers, all the packages are allright, everything is found and works, and have nice ClassNames.

Bert

Sibling order parsing hard to do with grammar changes

You changed:

c_objects: c_non_primitive_object_ordered+ | c_primitive_object ;
c_non_primitive_object_ordered: sibling_order? c_non_primitive_object ;

into

c_objects: ( sibling_order? c_non_primitive_object )+ | c_primitive_object ;

Which looks much better in the grammar. However, when parsing, it's a problem, because ANTLRgenerates the following methods in c_object

sibling_order() - returns a list of sibling orders
c_non_primitive_object() - returns a list of non primitive objets

Without any hint about which sibling order corresponds to which non primitive object. Which means you lose the ability to check which sibling order belongs to which c_non_primitive_object unless you perform some nasty tricks, which means this is now rather hard to use.

Lexer does not recognize URI with empty path

The following URI in ODIN is not recognized as an URI by the lexer:

http://www.test.example
http://www.test.example/

They are however both valid URIs.

The lexer does not recognize this because of the following lexer rules:

URI : URI_SCHEME SYM_COLON URI_HIER_PART ( '?' URI_QUERY )? ;
fragment URI_HIER_PART : ( '//' URI_AUTHORITY )? URI_PATH ;
fragment URI_PATH   : ( '/' URI_XPALPHA+ )+ ;

On first glance it looks like this can be fixed with a simple URI_PATH?. However, this clashes with the labels of the expression grammar.
So I tried:

fragment URI_HIER_PART : ( '//' URI_AUTHORITY ) | URI_PATH | ( '//' URI_AUTHORITY ) URI_PATH ;

Which is better, but it still clashes with the following rule statement:

label:/path/to/value + /other_path = 3

because it matches label:/path/to/value as an URI.

So the remaining fixes are:

Require the URI_AUTHORITY: fragment URI_HIER_PART : ( '//' URI_AUTHORITY ) URI_PATH? ;
Match the <>-characters that must always surround a URL in the lexer
Find a way to implement different lexer modes for different parts of the archetype
would be best I think. however, there is no easy way in the current ADL language design to implement lexer mode switching without resorting to rather complicated target language constructions. So I stuck with the first solution for now for archie, which is at least better than the alternatives. A better fix would be good though!

sibling order doesn't parse

If you write a sibling order before the first occuring attribute, it gets parsed.

If you put one between two attributes, it results in a parse error, or so it seems:

trying to parse adl2-tests/validity/specialisation/openEHR-EHR-OBSERVATION.spec_test_obs-VSSM_added_nodes_ordered.v1.adls
line 41:3 extraneous input 'before' expecting {'}', SYM_USE_NODE, SYM_USE_ARCHETYPE, SYM_ALLOW_ARCHETYPE, ALPHA_UC_ID}
line 41:11 mismatched input 'id8' expecting ALPHA_LC_ID
line 42:3 mismatched input 'ELEMENT' expecting SYM_MATCHES
line 42:11 mismatched input 'id0.3' expecting ALPHA_LC_ID
line 43:4 mismatched input 'value' expecting '['
line 47:2 mismatched input '}' expecting SYM_TERMINOLOGY

definition
    OBSERVATION[id1.1] matches {    -- specialisation containing ordered extension nodes
        /data/events[id3]/data/items matches {
            after [id1000]
            ELEMENT[id0.1] matches {    -- Text field 2
                value matches {
                    DV_TEXT[id0.1]  -- Text field 2
                }
            }
            ELEMENT[id0.2] matches {    -- Quantity 2
                value matches {
                    DV_QUANTITY[id0.2]  -- Quantity 2
                }
            }
            before [id8]
            ELEMENT[id0.3] matches {    -- Text field 3
                value matches {
                    DV_TEXT[id0.3]  -- Text field 3
                }
            }
        }
    }

problem with matching /path/elements matches {"value"}

the parser reports an ambiguity with the following DFA:

s0-'/'->s1
s1-ALPHA_LC_ID->s2
s2-'/'->s4
s2-SYM_MATCHES->:s3=>1
s4-ALPHA_LC_ID->:s5^=>1

alternatives 1 and 2 are ambiguous.

Then instead of outputting one attribute_def, it outputs two:

/path

and

/element matches {"value"}

Ambiguous comment token definition in ODIN grammar

The base patterns file here: https://github.com/openEHR/adl-antlr/blob/master/src/main/antlr/base_patterns.g4 which is used in odin grammar defines comment tokens as follows:

H_CMT_LINE : '--------' '-'? '\n' ; // special type of comment for splitting template overlays
CMT_LINE : '--' .? '\n' -> skip ; // (increment line count)

This introduces a problem with antlr 4.7, but interestingly, 4.5 manages to parse input, which I think should not be possible: probably Antlr's smart parsing algorithms are relaxing the rules a bit too much.

Since Antlr considers the declaration of tokens in the grammar files during parsing, the H_CMT_LINE token overrides the CMT_LINE token in the following examples, taken from openEHR RM bmm (CIMI bmms have same types of comments too):

------------------------------------------------------
-- BMM version on which these schemas are based.
------------------------------------------------------
bmm_version = <"2.1">

------------------------------------------------------
-- schema identification
-- (schema_id computed as <rm_publisher>_<schema_name>_<rm_release>)
------------------------------------------------------

The long comments above are meant to be comments and they don't imply the special template overlay comment. However, based on the grammar file above, the lexer is bound to match these long comment lines to H_CMT_LINE which I'd say is the correct behaviour. The result, using antlr 4.7 (the latest version) is that the parse tree never goes beyond the long comment lines.

Changing the order of CMT_LINE and H_CMT_LINE fixes the issue but I'm sure it'd break the behaviour within archetype files where template overlays are expressed.

I'd suggest changing the template overlay comment token to something else. Maybe something like:

H_CMT_LINE : '-/' ''*? '\n'  ;  // special type of comment for splitting template overlays

Regarding ARCHETYPE_HRID

It is better to have items which are needed in AOM parsing in the Parser-section instead of the Lexer. This makes the AOM more safe to write and easier to maintain. The difference between the Parser and the Lexer I am referring to is that the Parser creates classes to define the grammar, these classes are easier to query.

Take a look at chapter 5.6 in the Antlr book, there is a list on page 81/82, and I refer to the last item. On page 83 is an example which explains (kind of) this issue also.

So instead of
ARCHETYPE_HRID : ARCHETYPE_HRID_ROOT '.v' VERSION_ID ;
ARCHETYPE_REF : ARCHETYPE_HRID_ROOT '.v' INTEGER ( '.' DIGIT+ )* ;
fragment ARCHETYPE_HRID_ROOT : (NAMESPACE '::')? IDENTIFIER '-' IDENTIFIER '-' IDENTIFIER '.' LABEL ; ..................

It should become something like:
archetype_hrid: archetype_hrid_root '.v' VERSION_ID ;
archetype_hrid_root: (namespace '::')? IDENTIFIER '-' IDENTIFIER '-' IDENTIFIER '.' LABEL ;
namespace : LABEL ('.' LABEL)+ ;

(needs to be worked out further)

Ambiguity in lexer: ARCHETYPE_REF and ARCHETYPE_HRID

This leads to parser errors, for example the specializes section needs an archetype reference, not a HRID, but the token stream supplies a HRID, leading to mismatched input errors:

line 5:4 mismatched input 'openEHR-EHR-COMPOSITION.report_result.v1.0.0' expecting ARCHETYPE_REF

With the following ADL-fragment:

template (adl_version=2.0.5; rm_release=1.0.2; generated)
    openEHR-EHR-COMPOSITION.blood_pressure.v1.0.0

specialize
    openEHR-EHR-COMPOSITION.report_result.v1.0.0

Possible problem in c_terminology_code rule

You write in JIRA: "You can only have an assumed value in the ac_code"

Your solution does work, I did some testing to simplify it, but I found it is the best possible solution, because of the rule that only an AC_CODE can have an AT_CODE as assumedValue.

Well done. Case closed ;-)

.NET antlr parser errors

I'm using .NET Core 6 with antlr-4.10.1-complete.jar as a parser/lexer generator and with the Antlr4.Runtime.Standard package for runtime.
Got some syntax errors trying to parse the openEHR-EHR-SECTION.problem_list.v0.adl archetype (taken from CKM):

line 1:23 mismatched input '1.4' expecting VERSION_ID
line 5:2 mismatched input 'at0000' expecting AT_CODE

Here is a project to reproduce - https://github.com/da-baranov/openehr-sandbox/tree/main/openehr-antlr-test

public static void Main(string[] args)
{
            var fileContent = File.ReadAllText(@"openEHR-EHR-SECTION.problem_list.v0.adl", Encoding.UTF8);
            var inputStream = new AntlrInputStream(fileContent);
            var lexer = new Lexer(inputStream);
            var tokenStream = new CommonTokenStream(lexer);
            var parser = new Parser(tokenStream);

            var errorListener = new TestErrorListener();
            var listener = new TestListener();
            parser.AddParseListener(listener);
            parser.AddErrorListener(errorListener);
            parser.adl14_archetype();
}

UID does not parse dotted syntax

In the spec it says in the metadata section

| SYM_UID '=' ( DOTTED_NUMERIC | GUID ) // TODO: this won't match all Oids properly

In the grammar it says GUID only, so the dotted_numeric syntax does not, parse, for example
openehr-TEST_PKG-WHOLE.child_with_oid.v1.0.0

archetype (adl_version=2.0.5; rm_release=1.0.2; uid=2.4.34.666.7.2)
    openehr-TEST_PKG-WHOLE.child_with_oid.v1.0.0

value matches {True} does not seem to parse

trying to parse adl2-tests/features/terminology/value_sets/openehr-ehr-EVALUATION.term_constraint_variations.v0.0.1.adls
line 34:27 mismatched input '}' expecting '['
line 186:22 mismatched input 'nomedct.info' expecting '>'
line 186:37 mismatched input '/' expecting '='

when trying to parse

ELEMENT[id5] occurrences matches {0..1} matches {   -- Absolute Contraindication?
   value matches {
      DV_BOOLEAN[id52] matches {
         value matches {True}
      }
   }
}

assumed value tree-walking is annoying in the current code

The current syntax for c_integer is:

c_integer: ( integer_value | integer_list_value | integer_interval_value | integer_interval_list_value ) ( ';' integer_value )? ;

This makes it rather annoying to parse in code, because you get just an array of integer values and you have to check the semicolon to see how to tree it.

it could be changed in:

c_integer: ( integer_value | integer_list_value | integer_interval_value | integer_interval_list_value ) ( ';' assumed_integer_value )? ;
assumed_integer_value: integer_value;

Which would make walking this tree much easier.

ARCHETYPE_HRID with '-' in concept id cannot be parsed

Many existing archetype-ids contain '-' marks in the concept id. Those cannot currently be parsed with this grammar, the lexer does not parse an ARCHETYPE_HRID in that case

HLINE gets recognised as a CMT_LINE

Token stream logs:

[@2410,39246:39257='"Comentario"',<96>,596:24]
[@2411,39258:39258='>',<19>,596:36]
[@2412,39276:39286='description',<91>,597:16]
[@2413,39288:39288='=',<4>,597:28]
[@2414,39290:39290='<',<17>,597:30]
[@2415,39291:39341='"Comentario sobre la medida de la presión arterial"',<96>,597:31]
[@2416,39342:39342='>',<19>,597:82]
[@2417,39356:39356='>',<19>,598:12]
[@2418,39366:39366='>',<19>,599:8]
[@2419,39372:39372='>',<19>,600:4]
[@2420,39458:39473='template_overlay',<35>,604:0]
[@2421,39475:39475='(',<1>,604:17]
[@2422,39476:39486='adl_version',<46>,604:18]

While the input is:

      >
        ["sl"] = <

        >
        ["es-ar"] = <

        >
    >


---------------------------------------------------------------------------------
template_overlay (adl_version=2.0.5; rm_release=1.0.2; generated)

interval and list parsing needs quite a lot of code

Because the list values have been fully specified like, they are a bit inconvenient to parse:

primitive_list_value : 
      string_list_value 
    | integer_list_value 
    | real_list_value 
    | boolean_list_value 
    | character_list_value 
    | term_code_list_value
    | date_list_value
    | time_list_value 
    | date_time_list_value 
    | duration_list_value 
    ;

primitive_interval_value :
      integer_interval_value
    | real_interval_value
    | date_interval_value
    | time_interval_value
    | date_time_interval_value
    | duration_interval_value
    ;

A similar pattern applies to cadl. But it leads to the following copy/paste code - and the following file is just for real and integer numbers, the same applies to all other types:

https://github.com/nedap/archie/blob/bfa1e4cfcfeec27100384af323c5fbbb365e2c50/parser/src/main/java/com/nedap/archie/adlparser/NumberConstraintParser.java

So it is quite a lot of repetitive copy/paste code to handle these constructs

You could do instead:

primitive_list_value :  
    primitive_value ( ( ',' primitive_value )+ | ',' SYM_LIST_CONTINUE ) ;
primitive_interval_value: 
    '|' SYM_GT? primitive_value SYM_INTERVAL_SEP SYM_LT? primitive_value 
    '|' relop primitive_value

This allows for reuse of code both in java and the grammar.

it has the drawback of having a bit less validation in the ANTLR-grammar, and needing a bit more validation in the tree walker/listener. But this validation is rather simple:

both values in an interval have to be the same type
all values in a list have to be the same type
only some types can appear in an interval

Things to do for Bert Verhees

Since @BertVerhees asked for things to do, a list of suggestions of things he can do:

In the adl-anltr grammar (this project):
1. Fix the regular expression recognition in the ANTLR Lexer. The difficulty is that paths resemble regular expressions.

In archie (https://github.com/nedap/archie) :

Implement tests for any language construct - you'll test this grammar as well as the parser. the framework is already in place
Implement parsing of CTemporal types. The grammar should parse those already, but the treewalker does not.

NAMESPACE token should be a fragment

The NAMESPACE token is ambiguous with ARCHETYPE_HRID, so sometimes a HRID matches as a NAMESPACE - result, parse error.

Making namespace a fragment (it can only be used in one place, so that sounds right) solves this issue

Regexp parsing breaks duration constraints

        duration_attr20 matches {Pw/PT0S}
        duration_attr21 matches {Pmw/PT0S}
        duration_attr22 matches {PWD/PT0S}

Matches

/PT0S}
        duration_attr21 matches {Pmw/

as a regexp and breaks parsing with a MisMatchedInputException later on.

Possible solution: the following lexer rule:

CONTAINED_REGEXP: '{'WS* (SLASH_REGEXP | CARET_REGEXP) WS* (';' WS* STRING)? WS* '}';
fragment SLASH_REGEXP: '/' SLASH_REGEXP_CHAR+ '/';
fragment SLASH_REGEXP_CHAR: ~[/\n\r] | ESCAPE_SEQ | '\\/';

fragment CARET_REGEXP: '^' CARET_REGEXP_CHAR+ '^';
fragment CARET_REGEXP_CHAR: ~[^\n\r] | ESCAPE_SEQ | '\\^';

problem in definition adl_relative_path.

I give following command, in my specialized test (excuse the camelcase)
grun Test archetypeSlot -gui

Then I give this as argument:
allow_archetype OBSERVATION[id2] occurrences matches {0..1} matches { -- Vital signs
include
domain_concept/test matches {/blood_pressure.v2/}
domain_concept matches {/.*/}
}

It gives an error:
line 4:43 no viable alternative at input 'domain_conceptmatches'

If I change the input like this:
allow_archetype OBSERVATION[id2] occurrences matches {0..1} matches { -- Vital signs
include
domain_concept/test matches {/blood_pressure.v2/}
domain_concept/test matches {/.*/}
}

I get no error.

It is because of the definition of adl_relative_path which requires a slash somewhere in the string, else it cannot make a valid argument of it. This means that a relative path, consisting of one segment can never be valid.

This will not be a big problem in real life, it will not often happen that a relativepath without slash is used, but it is a weakness of the grammar.

ODIN URL values aren't parsed correctly

URLs in odin_text seem to not get parsed correctly with this grammar:

trying to parse adl2-tests/features/aom_structures/rules/openEHR-EHR-ADMIN_ENTRY.dependency_rule.v1.adls
line 238:23 mismatched input 'penehr.org' expecting '>'
line 238:36 mismatched input '/' expecting '='
line 239:36 mismatched input '/' expecting '='
line 240:36 mismatched input '/' expecting '='
line 241:36 mismatched input '/' expecting '='

Content

term_bindings = <
   ["openehr"] = <
      ["at37"] = <http://openehr.org/id/124>
      ["at38"] = <http://openehr.org/id/425>
      ["at39"] = <http://openehr.org/id/427>
      ["at40"] = <http://openehr.org/id/429>
   >

Multiline ODIN values aren't parsed correctly

trying to parse adl2-tests/features/description/text/openEHR-TEST_PKG-ENTRY.long_lines.v1.adls
line 21:10 no viable alternative at input '<"'
line 21:16 mismatched input 'is' expecting '='
line 21:19 mismatched input 'what' expecting '=```
... many more lines here

when trying to parse

language = <[ISO_639-1::en]>
purpose = <"what this is for">
use = <"this is what it is used for, this could be a really long line or even
      multiple lines, just like 
      what you are reading now">

Add symbol lexer rules for more easy tree walking

See what i changed to the grammars in nedap/archie@538415e

It is rather inconvenient to parse the '<'-characters without having them as named tokens. It would be a good idea to name them.

This applies to the following symbols:

;
<

<=

There will probably be more situations when i parse more of the grammar.