openehr / adl-antlr Goto Github PK
View Code? Open in Web Editor NEWAntrl4 grammars for ADL
License: Apache License 2.0
Antrl4 grammars for ADL
License: Apache License 2.0
In the archetype https://github.com/openEHR/adl-archetypes/blob/master/ADL2-reference/features/aom_structures/basic/openEHR-TEST_PKG-WHOLE.primitive_types.v1.adls the following snippet of ADL exists:
WHOLE[id1] matches { -- test entry
string_attr1 matches {"something"}
string_attr2 matches {/this|that|something else/}
string_attr3 matches {/cardio.*/}
string_attr4
}
This does not parse with the new grammar due to the following definition of c_attribute:
c_attribute: adl_dir? rm_attribute_id ( c_existence | c_cardinality | c_existence c_cardinality )
| adl_dir? rm_attribute_id c_existence? c_cardinality? SYM_MATCHES '{' c_objects '}'
;
Not sure if this is a bug in the grammar, or an incorrect test file.
There is no regular expression lexer rule present. So if you try to parse for example
allow_archetype OBSERVATION[id2] occurrences matches {0..1} matches { -- Vital signs
include
archetype_id/value matches {/openEHR-EHR-OBSERVATION\.exam(-a-zA-Z0-9_]+)*\.v1/}
exclude
archetype_id/value matches {/openEHR-EHR-OBSERVATION\.blood_pressure(-a-zA-Z0-9_]+)*\.v1/}
}
You get lexer errors:
trying to parse adl2-tests/validity/slots/openEHR-EHR-SECTION.VDSEV_slot_include_not_any_exclude_not_any.v1.adls
line 30:74 token recognition error at: '_'
line 32:84 token recognition error at: '_'
Due to the strict separation of boolean and arithmetic logic, /path/to/dv_boolean = true is not possible. This limits score calculations like:
probably_delirium:
/data[id2]/events[id3]/data[id4]/items[id38]/value/magnitude >= 3 implies /data[id2]/events[id3]/data[id4]/items[id41]/value/value = true
no_delirium:
/data[id2]/events[id3]/data[id4]/items[id38]/value/magnitude < 3 implies /data[id2]/events[id3]/data[id4]/items[id41]/value/value = false
Also no support for at-codes for dv_coded_text, no string support in '==' and '!=' operators, as well as probably quite a lot of other types.
I'm currently creating a version that does this. It will lose some type-safety checks in the grammar, but you have to do those in code anyway for checking the types of path lookup. It is very likely that we need an assertion-separator sign, or a whitespace sensitive grammar.
I made some small adaptations to your parser to make parsing just that little bit easier. This does not improve the functionality of the parser, only the usability.
Currently I'm implementing rules parsing. So far i've implemented what is in the grammar. Then I noticed the rules grammar is incomplete. At least missing is:
Also I cannot find an existing testset for the rules, and no open source rule parsing library. They look very useful for several things (score calculation, conditional questions, custom validations, etc...) and we certainly want to support them, including variables and queries. But the lack of open source libraries supporting this, and the lack of use in the CKM makes me wonder - is this actually being used?
The operators in the rules section have no operator precedence. It's very useful to include this in the antlr grammar. Something along the lines of:
expr: mult ('+' mult)* ;
mult: atom ('*' atom)* ;
atom: INT | '(' expr ')' ;
You can then simply walk the tree to evaluate expressions.
Useful both for arithmetic and boolean expressions.
time_attr1 matches {hh:mm:ss}
time_attr2 matches {hh:mm:??}
time_attr3 matches {hh:mm:XX}
time_attr4 matches {hh:??:??}
time_attr5 matches {hh:??:XX}
gives errors:
[Test worker] INFO com.nedap.archie.adlparser.LargeSetOfADLsTest - trying to parse adl2-tests/features/aom_structures/basic/openEHR-TEST_PKG-WHOLE.assumed_values.v1.adls
[Test worker] WARN com.nedap.archie.adlparser.ADLErrorListener - FULL CONTEXT: 135-135, alts: {2, 5}
[Test worker] WARN com.nedap.archie.adlparser.ADLErrorListener - FULL AMBIGUITY: 135-135, exact: false
line 59:22 no viable alternative at input 'hh:mm:ss;'
line 60:22 no viable alternative at input 'hh:mm:XX;'
line 61:22 no viable alternative at input 'hh:??:XX;'
line 62:22 no viable alternative at input 'hh:??:??;'
I haven't checked the token stream, but the ambiguity warning at token 135 could be a cause.
By the way, the Maven plugin is configured default as this:
http://www.antlr.org/api/maven-plugin/latest/usage.html
(the version in this example should be 4.5)
Generating Javafiles will be then: mvn antlr4:antlr4
The Javafiles will be then in target/generated-sources/antlr4/....(package)...
You find all worked out in my fork:
https://github.com/BertVerhees/adl-antlr
If you do it like this, handling the grammars will be very easy for Java-programmers, all the packages are allright, everything is found and works, and have nice ClassNames.
Bert
You changed:
c_objects: c_non_primitive_object_ordered+ | c_primitive_object ;
c_non_primitive_object_ordered: sibling_order? c_non_primitive_object ;
into
c_objects: ( sibling_order? c_non_primitive_object )+ | c_primitive_object ;
Which looks much better in the grammar. However, when parsing, it's a problem, because ANTLRgenerates the following methods in c_object
sibling_order() - returns a list of sibling orders
c_non_primitive_object() - returns a list of non primitive objets
Without any hint about which sibling order corresponds to which non primitive object. Which means you lose the ability to check which sibling order belongs to which c_non_primitive_object unless you perform some nasty tricks, which means this is now rather hard to use.
The following URI in ODIN is not recognized as an URI by the lexer:
http://www.test.example
http://www.test.example/
They are however both valid URIs.
The lexer does not recognize this because of the following lexer rules:
URI : URI_SCHEME SYM_COLON URI_HIER_PART ( '?' URI_QUERY )? ;
fragment URI_HIER_PART : ( '//' URI_AUTHORITY )? URI_PATH ;
fragment URI_PATH : ( '/' URI_XPALPHA+ )+ ;
On first glance it looks like this can be fixed with a simple URI_PATH?
. However, this clashes with the labels of the expression grammar.
So I tried:
fragment URI_HIER_PART : ( '//' URI_AUTHORITY ) | URI_PATH | ( '//' URI_AUTHORITY ) URI_PATH ;
Which is better, but it still clashes with the following rule statement:
label:/path/to/value + /other_path = 3
because it matches label:/path/to/value
as an URI.
So the remaining fixes are:
Require the URI_AUTHORITY: fragment URI_HIER_PART : ( '//' URI_AUTHORITY ) URI_PATH? ;
Match the <>
-characters that must always surround a URL in the lexer
Find a way to implement different lexer modes for different parts of the archetype
would be best I think. however, there is no easy way in the current ADL language design to implement lexer mode switching without resorting to rather complicated target language constructions. So I stuck with the first solution for now for archie, which is at least better than the alternatives. A better fix would be good though!
If you write a sibling order before the first occuring attribute, it gets parsed.
If you put one between two attributes, it results in a parse error, or so it seems:
trying to parse adl2-tests/validity/specialisation/openEHR-EHR-OBSERVATION.spec_test_obs-VSSM_added_nodes_ordered.v1.adls
line 41:3 extraneous input 'before' expecting {'}', SYM_USE_NODE, SYM_USE_ARCHETYPE, SYM_ALLOW_ARCHETYPE, ALPHA_UC_ID}
line 41:11 mismatched input 'id8' expecting ALPHA_LC_ID
line 42:3 mismatched input 'ELEMENT' expecting SYM_MATCHES
line 42:11 mismatched input 'id0.3' expecting ALPHA_LC_ID
line 43:4 mismatched input 'value' expecting '['
line 47:2 mismatched input '}' expecting SYM_TERMINOLOGY
definition
OBSERVATION[id1.1] matches { -- specialisation containing ordered extension nodes
/data/events[id3]/data/items matches {
after [id1000]
ELEMENT[id0.1] matches { -- Text field 2
value matches {
DV_TEXT[id0.1] -- Text field 2
}
}
ELEMENT[id0.2] matches { -- Quantity 2
value matches {
DV_QUANTITY[id0.2] -- Quantity 2
}
}
before [id8]
ELEMENT[id0.3] matches { -- Text field 3
value matches {
DV_TEXT[id0.3] -- Text field 3
}
}
}
}
the parser reports an ambiguity with the following DFA:
s0-'/'->s1
s1-ALPHA_LC_ID->s2
s2-'/'->s4
s2-SYM_MATCHES->:s3=>1
s4-ALPHA_LC_ID->:s5^=>1
alternatives 1 and 2 are ambiguous.
Then instead of outputting one attribute_def, it outputs two:
/path
and
/element matches {"value"}
The base patterns file here: https://github.com/openEHR/adl-antlr/blob/master/src/main/antlr/base_patterns.g4 which is used in odin grammar defines comment tokens as follows:
H_CMT_LINE : '--------' '-'? '\n' ; // special type of comment for splitting template overlays
CMT_LINE : '--' .? '\n' -> skip ; // (increment line count)
This introduces a problem with antlr 4.7, but interestingly, 4.5 manages to parse input, which I think should not be possible: probably Antlr's smart parsing algorithms are relaxing the rules a bit too much.
Since Antlr considers the declaration of tokens in the grammar files during parsing, the H_CMT_LINE token overrides the CMT_LINE token in the following examples, taken from openEHR RM bmm (CIMI bmms have same types of comments too):
------------------------------------------------------
-- BMM version on which these schemas are based.
------------------------------------------------------
bmm_version = <"2.1">
------------------------------------------------------
-- schema identification
-- (schema_id computed as <rm_publisher>_<schema_name>_<rm_release>)
------------------------------------------------------
The long comments above are meant to be comments and they don't imply the special template overlay comment. However, based on the grammar file above, the lexer is bound to match these long comment lines to H_CMT_LINE which I'd say is the correct behaviour. The result, using antlr 4.7 (the latest version) is that the parse tree never goes beyond the long comment lines.
Changing the order of CMT_LINE and H_CMT_LINE fixes the issue but I'm sure it'd break the behaviour within archetype files where template overlays are expressed.
I'd suggest changing the template overlay comment token to something else. Maybe something like:
H_CMT_LINE : '-/' ''*? '\n' ; // special type of comment for splitting template overlays
It is better to have items which are needed in AOM parsing in the Parser-section instead of the Lexer. This makes the AOM more safe to write and easier to maintain. The difference between the Parser and the Lexer I am referring to is that the Parser creates classes to define the grammar, these classes are easier to query.
Take a look at chapter 5.6 in the Antlr book, there is a list on page 81/82, and I refer to the last item. On page 83 is an example which explains (kind of) this issue also.
So instead of
ARCHETYPE_HRID : ARCHETYPE_HRID_ROOT '.v' VERSION_ID ;
ARCHETYPE_REF : ARCHETYPE_HRID_ROOT '.v' INTEGER ( '.' DIGIT+ )* ;
fragment ARCHETYPE_HRID_ROOT : (NAMESPACE '::')? IDENTIFIER '-' IDENTIFIER '-' IDENTIFIER '.' LABEL ; ..................
It should become something like:
archetype_hrid: archetype_hrid_root '.v' VERSION_ID ;
archetype_hrid_root: (namespace '::')? IDENTIFIER '-' IDENTIFIER '-' IDENTIFIER '.' LABEL ;
namespace : LABEL ('.' LABEL)+ ;
(needs to be worked out further)
This leads to parser errors, for example the specializes section needs an archetype reference, not a HRID, but the token stream supplies a HRID, leading to mismatched input errors:
line 5:4 mismatched input 'openEHR-EHR-COMPOSITION.report_result.v1.0.0' expecting ARCHETYPE_REF
With the following ADL-fragment:
template (adl_version=2.0.5; rm_release=1.0.2; generated)
openEHR-EHR-COMPOSITION.blood_pressure.v1.0.0
specialize
openEHR-EHR-COMPOSITION.report_result.v1.0.0
You write in JIRA: "You can only have an assumed value in the ac_code"
Your solution does work, I did some testing to simplify it, but I found it is the best possible solution, because of the rule that only an AC_CODE can have an AT_CODE as assumedValue.
Well done. Case closed ;-)
I'm using .NET Core 6 with antlr-4.10.1-complete.jar as a parser/lexer generator and with the Antlr4.Runtime.Standard package for runtime.
Got some syntax errors trying to parse the openEHR-EHR-SECTION.problem_list.v0.adl
archetype (taken from CKM):
line 1:23 mismatched input '1.4' expecting VERSION_ID
line 5:2 mismatched input 'at0000' expecting AT_CODE
Here is a project to reproduce - https://github.com/da-baranov/openehr-sandbox/tree/main/openehr-antlr-test
public static void Main(string[] args)
{
var fileContent = File.ReadAllText(@"openEHR-EHR-SECTION.problem_list.v0.adl", Encoding.UTF8);
var inputStream = new AntlrInputStream(fileContent);
var lexer = new Lexer(inputStream);
var tokenStream = new CommonTokenStream(lexer);
var parser = new Parser(tokenStream);
var errorListener = new TestErrorListener();
var listener = new TestListener();
parser.AddParseListener(listener);
parser.AddErrorListener(errorListener);
parser.adl14_archetype();
}
In the spec it says in the metadata section
| SYM_UID '=' ( DOTTED_NUMERIC | GUID ) // TODO: this won't match all Oids properly
In the grammar it says GUID only, so the dotted_numeric syntax does not, parse, for example
openehr-TEST_PKG-WHOLE.child_with_oid.v1.0.0
archetype (adl_version=2.0.5; rm_release=1.0.2; uid=2.4.34.666.7.2)
openehr-TEST_PKG-WHOLE.child_with_oid.v1.0.0
trying to parse adl2-tests/features/terminology/value_sets/openehr-ehr-EVALUATION.term_constraint_variations.v0.0.1.adls
line 34:27 mismatched input '}' expecting '['
line 186:22 mismatched input 'nomedct.info' expecting '>'
line 186:37 mismatched input '/' expecting '='
when trying to parse
ELEMENT[id5] occurrences matches {0..1} matches { -- Absolute Contraindication?
value matches {
DV_BOOLEAN[id52] matches {
value matches {True}
}
}
}
The current syntax for c_integer is:
c_integer: ( integer_value | integer_list_value | integer_interval_value | integer_interval_list_value ) ( ';' integer_value )? ;
This makes it rather annoying to parse in code, because you get just an array of integer values and you have to check the semicolon to see how to tree it.
it could be changed in:
c_integer: ( integer_value | integer_list_value | integer_interval_value | integer_interval_list_value ) ( ';' assumed_integer_value )? ;
assumed_integer_value: integer_value;
Which would make walking this tree much easier.
Many existing archetype-ids contain '-' marks in the concept id. Those cannot currently be parsed with this grammar, the lexer does not parse an ARCHETYPE_HRID in that case
Token stream logs:
[@2410,39246:39257='"Comentario"',<96>,596:24]
[@2411,39258:39258='>',<19>,596:36]
[@2412,39276:39286='description',<91>,597:16]
[@2413,39288:39288='=',<4>,597:28]
[@2414,39290:39290='<',<17>,597:30]
[@2415,39291:39341='"Comentario sobre la medida de la presión arterial"',<96>,597:31]
[@2416,39342:39342='>',<19>,597:82]
[@2417,39356:39356='>',<19>,598:12]
[@2418,39366:39366='>',<19>,599:8]
[@2419,39372:39372='>',<19>,600:4]
[@2420,39458:39473='template_overlay',<35>,604:0]
[@2421,39475:39475='(',<1>,604:17]
[@2422,39476:39486='adl_version',<46>,604:18]
While the input is:
>
["sl"] = <
>
["es-ar"] = <
>
>
---------------------------------------------------------------------------------
template_overlay (adl_version=2.0.5; rm_release=1.0.2; generated)
Because the list values have been fully specified like, they are a bit inconvenient to parse:
primitive_list_value :
string_list_value
| integer_list_value
| real_list_value
| boolean_list_value
| character_list_value
| term_code_list_value
| date_list_value
| time_list_value
| date_time_list_value
| duration_list_value
;
primitive_interval_value :
integer_interval_value
| real_interval_value
| date_interval_value
| time_interval_value
| date_time_interval_value
| duration_interval_value
;
A similar pattern applies to cadl. But it leads to the following copy/paste code - and the following file is just for real and integer numbers, the same applies to all other types:
So it is quite a lot of repetitive copy/paste code to handle these constructs
You could do instead:
primitive_list_value :
primitive_value ( ( ',' primitive_value )+ | ',' SYM_LIST_CONTINUE ) ;
primitive_interval_value:
'|' SYM_GT? primitive_value SYM_INTERVAL_SEP SYM_LT? primitive_value
'|' relop primitive_value
This allows for reuse of code both in java and the grammar.
it has the drawback of having a bit less validation in the ANTLR-grammar, and needing a bit more validation in the tree walker/listener. But this validation is rather simple:
Since @BertVerhees asked for things to do, a list of suggestions of things he can do:
In archie (https://github.com/nedap/archie) :
The NAMESPACE token is ambiguous with ARCHETYPE_HRID, so sometimes a HRID matches as a NAMESPACE - result, parse error.
Making namespace a fragment (it can only be used in one place, so that sounds right) solves this issue
duration_attr20 matches {Pw/PT0S}
duration_attr21 matches {Pmw/PT0S}
duration_attr22 matches {PWD/PT0S}
Matches
/PT0S}
duration_attr21 matches {Pmw/
as a regexp and breaks parsing with a MisMatchedInputException later on.
Possible solution: the following lexer rule:
CONTAINED_REGEXP: '{'WS* (SLASH_REGEXP | CARET_REGEXP) WS* (';' WS* STRING)? WS* '}';
fragment SLASH_REGEXP: '/' SLASH_REGEXP_CHAR+ '/';
fragment SLASH_REGEXP_CHAR: ~[/\n\r] | ESCAPE_SEQ | '\\/';
fragment CARET_REGEXP: '^' CARET_REGEXP_CHAR+ '^';
fragment CARET_REGEXP_CHAR: ~[^\n\r] | ESCAPE_SEQ | '\\^';
I give following command, in my specialized test (excuse the camelcase)
grun Test archetypeSlot -gui
Then I give this as argument:
allow_archetype OBSERVATION[id2] occurrences matches {0..1} matches { -- Vital signs
include
domain_concept/test matches {/blood_pressure.v2/}
domain_concept matches {/.*/}
}
It gives an error:
line 4:43 no viable alternative at input 'domain_conceptmatches'
If I change the input like this:
allow_archetype OBSERVATION[id2] occurrences matches {0..1} matches { -- Vital signs
include
domain_concept/test matches {/blood_pressure.v2/}
domain_concept/test matches {/.*/}
}
I get no error.
It is because of the definition of adl_relative_path which requires a slash somewhere in the string, else it cannot make a valid argument of it. This means that a relative path, consisting of one segment can never be valid.
This will not be a big problem in real life, it will not often happen that a relativepath without slash is used, but it is a weakness of the grammar.
URLs in odin_text seem to not get parsed correctly with this grammar:
trying to parse adl2-tests/features/aom_structures/rules/openEHR-EHR-ADMIN_ENTRY.dependency_rule.v1.adls
line 238:23 mismatched input 'penehr.org' expecting '>'
line 238:36 mismatched input '/' expecting '='
line 239:36 mismatched input '/' expecting '='
line 240:36 mismatched input '/' expecting '='
line 241:36 mismatched input '/' expecting '='
Content
term_bindings = <
["openehr"] = <
["at37"] = <http://openehr.org/id/124>
["at38"] = <http://openehr.org/id/425>
["at39"] = <http://openehr.org/id/427>
["at40"] = <http://openehr.org/id/429>
>
trying to parse adl2-tests/features/description/text/openEHR-TEST_PKG-ENTRY.long_lines.v1.adls
line 21:10 no viable alternative at input '<"'
line 21:16 mismatched input 'is' expecting '='
line 21:19 mismatched input 'what' expecting '=```
... many more lines here
when trying to parse
language = <[ISO_639-1::en]>
purpose = <"what this is for">
use = <"this is what it is used for, this could be a really long line or even
multiple lines, just like
what you are reading now">
See what i changed to the grammars in nedap/archie@538415e
It is rather inconvenient to parse the '<'-characters without having them as named tokens. It would be a good idea to name them.
This applies to the following symbols:
;
<
<=
There will probably be more situations when i parse more of the grammar.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.