melt-umn / copper Goto Github PK

View Code? Open in Web Editor NEW

17.0 12.0 4.0 2.81 MB

An integrated context-aware scanner and parser generator

Home Page: http://melt.cs.umn.edu/copper

License: GNU Lesser General Public License v3.0

HTML 0.04% Java 97.29% Logos 1.08% XSLT 1.58%

java parsing parser-generator

copper's People

Contributors

Stargazers

Watchers

Forkers

viratyosin massimo-zaniboni mkirwin keltono

copper's Issues

Adding copper to maven central

Hey everyone,
We (namely @krame505 and I) were thinking about adding copper to maven central.
The process to get this done is not hugely complex, but there are a few steps that require some extra info from everyone.

To get the project published, we need a txt record to show that we are in control of the domain, for which I think we just need to talk to CSE-IT.
- the txt record points to a jira ticket -- I made a personal account but in the process of creating the ticket it also wants a list of others who would want to contribute. I talked it over with Lucas and eric and we decided that individual accounts for everyone would be enough, so everyone would need to create an account here and give me their usernames.

There are a few other steps (like updating the pom.xml, generating the PGP keys, etc.) but I think the list of usernames is the only thing that I can't just do myself.

Let me know if anyone has any further thoughts!

Significant memory usage regression with #59 on large grammars

To test out the new features in #59 I tried commenting out the associativity and precedence for Or_t in the Silver compiler:

terminal Or_t          '||' lexer classes {OP};--, precedence = 5, association = left;

./self-compile now just hangs at parser generation with high memory usage, and does not print anything, even if I bump it up the heap size by 10 GB. It eventually crashes with a "java heap space" error.

This behavior only happens when there is a parser conflict; Copper functions normally with an unambiguous grammar.

Layout on extension productions can cause unexpected parse errors in host code

Some unexpected behavior was discovered when experimenting with layout on an ableC extension for prolog-style logic programming. The following is a somewhat simplified grammar that seems to exhibit the same issue.

Grammar "host":

ignore terminal WhiteSpace_t /[\t\r\n ]+/;

terminal Plus_t '+' association=left, precedence=0;
terminal Mod_t '%' association=left, precedence=1;

terminal LParen_t '(';
terminal RParen_t ')';
terminal LCurly_t '{';
terminal RCurly_t '}';
terminal Semi_t ';';

terminal Id_t /[a-zA-Z]+/;

nonterminal Stmt;
concrete productions top::Stmt
| Expr ';' {}
| '{' Stmt '}' {}

nonterminal Expr;
concrete productions top::Expr
| '(' Expr ')' {}
| Expr '+' Expr {}
| Expr '%' Expr {}
| Id_c '(' ')' {}
| Id_c {}

nonterminal Id_c;
concrete productions top::Id_c
| Id_t {}

Grammar "ext":

terminal ExtComment_t /% .*/;

marking terminal Ext_t 'ext' dominates Id_t;
terminal Dot_t '.';

concrete production extProd
top::Stmt ::= 'ext' '{' id::Id_c '(' ')' '.' '}'
layout { ExtComment_t }
{}

Using a parser built only from "host" the string

{
  a % b;
}

parses successfully. However using a parser containing both host and ext (generated copper spec for reference: Parser_copper_features_test_layout_lookahead_parse_ext.copper) for the same string, the following parse error results:

   Error at line 3, column 0 in file 
         (parser state: 3; real character index: 12):
  Expected a token of one of the following types:
   [copper_features:test_layout:lookahead:ext:ExtComment_t,
    '(',
    '%',
    '+',
    ')',
    ';',
    copper_features:test_layout:lookahead:host:WhiteSpace_t]
   Input currently matches:
   ['}']

This is a rather unexpected result because the introduction of an extension causes unrelated, existing code to suddenly break, without any sort of lexical ambiguity being raised. If I correctly understand what is going on here, this is happens because of DFA states differing only in lookahead being merged, resulting in the layout terminal dominating due to maximal munch?

This behavior seems rather undesirable, and at least should emit some sort of warning. Or would it even be possible to modify the LALR(1) parser construction algorithm to not merge states that have different layout?

XML Schema forbids grammars with multiple bridge productions

The BridgeProductions Schema portion demands a sequence of a single bridge production for a grammar, despite copper containing logic to handle multiple bridge productions in a grammar.

I don't know if there is a reason not to, but if there is not, a simple change to that schema would enable grammars with multiple bridge productions:

<xs:complexType>
	<xs:sequence>
		<xs:element name="ProductionRef" type="ProductionRef" minOccurs="1" maxOccurs="unbounded"/>
	</xs:sequence>
</xs:complexType>

Make dumps programmatically accessible.

Currently, Copper produces parser dumps by passing the entire internal representation of the grammar, LR DFA, and parse table into a "dumper" class for processing. This makes it difficult to extract any intermediate grammar information programmatically, such as is required for resolving issue #22.

This could be resolved by making use of the new schema for XML dumps (see issue #25), or by making the "dumpers" into event handlers, similar to how logging is currently handled. The latter would have the advantage of not circumscribing what is allowed to appear in a dump, so we could support different kinds of dumps for alternate pipelines.

Error for empty ranges in regular expressions

The regular expression /[A-Za-Z]/ is obviously intended to capture both capital and lowercase letters, but doesn't. It looks like a Java error occurs if we have just /[a-Z]/ but there should probably be an error message given if there are no characters in a range. Because there are no characters between a and Z, we should generate an error message in both cases.

Lexical ambiguities on marking terminals should generate helpful message.

The error message for a lexical ambiguity on marking terminals is displayed just like any other kind of lexical ambiguity. But this is the one error message that a programmer might see when composing language extensions. This should be a special message with some instructions on what it is, and how to fix it. Perhaps containing a URL to good instructions and and explanation.

Disambiguation functions are free to return a terminal not part of the set they are disambiguating

While looking at an issue on Silver, I discovered that it doesn't seem like Copper validates that the terminal returned from a disambiguation function is one of the ones that the function ought to be disambiguate between. For example, the Silver code

terminal Id_t /[a-zA-Z][a-zA-Z0-9]*/;
terminal IntLit_t /[0-9]+/;
terminal If_t /if/;
ignore terminal WhiteSpace_t /[\t\r\n\ ]+/  ;

disambiguate Id_t, If_t {
  pluck IntLit_t;
}

produces a parser that will parse "if" as an IntLit_t. Complete example as a Gist.

I have a patch I can PR against Copper to add such a check to the site in SingleDFAEngine where we call runDisambiguationAction. Does it belong there, or in the code we generate for runDisambiguationAction/disambiguate_XXX?

Separate disambiguation functions from disambiguation groups in XML dump.

Currently, the XML- and HTML-format dumps make no distinction between disambiguation functions that disambiguate non-declaratively and disambiguation groups that always disambiguate to the same terminal. This should be added as part of the resolution of ticket #25.

Multiple starting nonterminals in parser?

How difficult would this be to support on Copper's side?

I have in mind amending Copper specs to support something like:

    <StartSymbol>
      <NonterminalRef id="silver_definition_core_Root" grammar="host" />
    </StartSymbol>
    <AlternativeStarts>
      <AlternativeStart id="parseExpr">
        <NonterminalRef id="silver_definition_core_Expr" grammar="host" />
      </AlternativeStart>
      ...
    </AlternativeStarts>

The result being generating additional methods like parseExpr in the generated parser class, beside the normally generated one for the normal start symbol.

On the Silver side, to generate a spec like this, we could allow a syntax like:

parser parseC :: Root {
  grammar:a;
  grammar:b;
  parser parseExpr :: Expr;
  parser parseDecl :: Decl;
}

And the avoid re-compiling the same grammar over and over to get these extra parsers.

...Would this impact anything like the modular determinism analysis? (My guess is not really, because it's just another state (or few) in the original parser, which are just analyzed normally...)

Grammar railroad diagram

Would be nice if copper could also generate an EBNF as understood by https://www.bottlecaps.de/rr/ui to generate railroad diagrams (https://en.wikipedia.org/wiki/Syntax_diagram).

I extended bison, byacc, lemon and btyacc to do so and can be seen here https://github.com/mingodad/lalr-parser-test , also CocoR here https://github.com/mingodad/CocoR-Java , unicc here https://github.com/mingodad/unicc , and peg/leg here https://github.com/mingodad/peg .

Would be nice to have it output a consolidated EBNF to have a full global view of the final grammar because indirect usage of copper, silver and ableC uses several pieces to compose the final grammar.

Allow shiftable set to be accessed from disambiguation functions applicable to subsets

Currently in disambiguation functions that are applicable to subsets (we really need to come up with a shorter name...) there is no way of knowing what subset is actually being disambiguated, so a particular lexeme must always disambiguate to the same terminal regardless of what is permitted, thus effectively losing some of the benefits of context awareness in parsing.
For example in ableC, we are using a disambiguation function applicable to subsets for identifier-like terminals (e.g. types and templated names.) In ambiguities where e.g. a regular identifier or template identifier is permitted, we can check a context parser attribute to find that the name has been previously defined as a template function, perhaps as a forward declaration. However, later on in the actual definition of the function, the same name appears in an ambiguity between an identifier and a type name. Since the name isn't a type, and a template name isn't allowed at this point, the proper thing to do here is to default to parsing the name as an identifier, but since we don't have access in the disambiguation class to the shiftable set, this isn't possible.
I propose making the shiftable bit set from match.terms available as a new parameter to functions generated as disambiguation functions applicable to subsets. Thoughts?

XML SAX parser warnings

If there is a lexical ambiguity (and maybe in other situations as well), Copper print the following warning:

[copper] Warning: org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser: Property 'http://www.oracle.com/xml/jaxp/properties/entityExpansionLimit' is not recognized.
[copper] Compiler warnings:
[copper] WARNING: 'org.apache.xerces.jaxp.SAXParserImpl: Property 'http://javax.xml.XMLConstants/property/accessExternalDTD' is not recognized.'

It doesn't seem to cause any problems, but we should still fix it.

Use Maven to build Copper.

Copper currently relies on a hard-wired pathname to handle its dependency on ANT (for building its ANT task) and, for the same reason, cannot easily make use of JUnit or another unit-test framework.

I believe the dependencies can be better handled by using Maven instead of ANT to build Copper. As I understand the process, this would involve splitting the current codebase into five bundles: runtime, compile time, ANT task, test suite, and an "aggregate" bundle to bring everything together.

Switching to Maven will necessitate some changes to the automatic builds of Copper currently being run on the MELT system, but should not result in any substantial change to the final JARs.

This should be considered as blocking issues #11 and #12.

Convert Copper user manual to Markdown.

Copper's user manual, being written in LaTeX, is not available in "online" form, but only as a downloadable PDF.

As far as I know there is nothing within the manual preventing its straightforward conversion into Markdown, which would allow it to be displayed online without eliminating the possibility of a PDF version.

Copper hangs when compiling a simple grammar

See title, Copper didn't terminate when trying to compile the attached .copper file
Parser_edu_umn_cs_melt_metaocaml_driver_parse.copper.txt
(renamed so github lets me upload it).
This corresponds to the Silver grammar


terminal Identifier_t /[A-Za-z_\$][A-Za-z_0-9\$]*/;
terminal LParen_t '(';
terminal RParen_t ')';

terminal App_t '' association = left, precedence = 10;

nonterminal Expr_c with ast<Expr>, location;

concrete productions top::Expr_c
| e1::Expr_c e2::Expr_c
  operator=App_t
  { top.ast = appExpr(e1.ast, e2.ast, location=top.location); }
| id::Identifier_t
  { top.ast = varExpr(id.lexeme, location=top.location); }

Also, as a side question, does Copper support setting the associativity of a production lacking terminals using the <Operator> field? It seems like it should but the behavior I am seeing seems to indicate not.

Better management of empty productions

Hi, I'm asking a feedback for a new feature/behavior that I would add to Copper, if I'm able. I'm not an expert of LALR parsers, so you can inform me if this is a stupid idea.

In my grammar I have a lot of productions with empty rules like this:

nonterminal Stmt;
concrete productions r::Stmt
| 'if' c::Condition 'then' s::Stmt else::OptionalElse { }
| ... other cases ...

nonterminal OptionalElse;
concrete productions r::OptionalElse
| 'else' s::Stmt { }
| { }

With two or more consecutive empty rules, the parser signal a shift/reduce conflict, because it does not know if the empty string is assigned to the inner or outer if.

In all my rules, I want that the meaning of a production with an explicit empty rule (but not of general shift/reduce conflictsf) is "if there is no match, match the empty rule, and then continue with the parsing of the input string from left to right."

I suspect that this feature is implementable, because the parser can tell if a production is matching or not, without incurring in a backtracking process. It is always a LALR(1) parser, and the test is only on a production, so the lookahead suffices.

I like this approach because it makes a LALR parser (IMHO) more user-friendly, without loosing the benefit of static checking of grammars of LALR parsers, respect top-down parsers. In particular this behavior allows to manage productions with nested forms like A?, A*, A+, because they are all reducible to productions with a default empty rule.

Can you inform me, if I'm wrong/naive in some point of my discourse, because I'm not expert of the field of parsers, and before working on it, it is better for me receiving some feedback/advice.

Thanks in any case, and many thanks for the good work with Copper.
Massimo

Make new Copper release.

With the addition of Kevin's new parser compilation pipelines and other features to Copper, a version number bump is in order.

Issues #8 and #10 should be resolved first, so I am assigning to myself to start.

Once these issues are closed, the release candidate should be put through some more aggressive testing, since many of the changes since Copper 0.7.2 could potentially destabilize the default pipeline. Theoretically, all changes in behavior would be outside the main pipeline, so it should suffice to swap it into Silver in the place of 0.7.2 and see if anything breaks.

Unresolvable lexical ambiguities between ignore terminals not caught during compilation

For example, the following minimal Silver grammar:

grammar ignore_conflict;

ignore terminal Foo1 'foo';
ignore terminal Foo2 'foo';

nonterminal Root;

concrete production root
top::Root ::=
{}

parser parse::Root {
  ignore_conflict;
}

function main 
IOVal<Integer> ::= args::[String] ioin::IO
{
  return ioval(print(parse("foo", "test").parseErrors, ioin), 0);
}

reports "No lexical ambiguities detected" when compiled, but produces the following error when run:

Error at line 1, column 0 in file test
         (parser state: 1; real character index: 0):
  Lexical ambiguity between tokens:
   ['foo',
    'foo']

For reference, here is the generated .copper file: Parser_ignore_conflict_parse.copper

Verify all production signatures are unique.

A TODO in GrammarConsistencyChecker.java suggests that Copper makes no check for duplicate productions.

Subtle change to dominates relationship between terminals

Another issue I recalled and wanted to write down before it's forgotten.

Right now we have a very small problem with the Silver parser:

terminal Foo 'foo' lexerclasses {KEYWORD};

This ought to be a parse error. The correct syntax is lexer classes as two words. This code is accepted because Copper narrows the set of acceptable terminals down to just some keywords, without identifiers in the valid set. As a result, the (expected?) behavior of recognizing this as one big keyword fails to happen, and instead it sees things as two keywords with no whitespace.

I dug up an old email chain about this, August had these comments as a potential solution:

Currently, the precedence relation "T submits-to U" prevents T from matching a string (e.g., lexe) in the presence of a longer match to U (e.g., lexer) whether or not U is in the valid lookahead set. We could also bring in a reciprocal constraint -- preventing U from matching in the presence of a longer match to T (e.g., lexerclasses) whether or not T is in the valid lookahead set.

The current precedence relations do the equivalent of bringing into the valid lookahead set all terminals of higher precedence than those already there. After this change, they would do what you suggested -- bringing in all terminals of lower precedence as well.

From the theoretical standpoint, this would bring the behavior of the precedence relation into greater conformance with that of its traditional-scanner counterpart, without disrupting context-awareness any more than currently -- the scanner would continue to behave within the strictures of the context-aware scanner "contract."

Correct me if I'm wrong, but I don't think we ever did this.

Get Copper's XML I/O in order.

Currently, Copper's use of XML for input and output (''viz.'', the XML input skin and the XML dump format) is not properly specified or documented. The main shortcomings are:

The output XML dump format is not documented. There is an XSLT stylesheet to transform an XML dump into an HTML dump, but there is no schema or object model for the former.
The namespace on the input XML skin contains no version information. It would therefore be very difficult to maintain any kind of backward compatibility over schema changes.

These issues should be resolved as follows:

Work out a convention for all Copper-related namespaces. This convention should ideally allow for version updates to any schema associated with Copper, independent of any other schema or of the Copper version.
Create a schema for the XML dump format, with a new namespace conforming to this convention.
Change the namespace on the XML skin schema to conform to this convention.

The change of namespace in the XML skin will require corresponding changes in Silver and in the test cases.

SlidingWindowScannerBuffer building a new string for every character read?

A TODO in SlidingWindowScannerBuffer.java suggests that the buffer is inefficient and building a new string object for every character read.

We should either verify that this is not happening or find a way to eliminate it.

Allow number of items per line to vary when pretty printing.

Currently, the pretty-printing methods responsible for printing lists of grammar symbols for Copper's error messages [1] have the caller specify a fixed number of items per line.

This should be changed to allow the caller to specify the maximum number of characters per line, as well as whether the items must be specified in order, allowing the number of items per line to vary and an optimal use made of space.

Disambiguation classes

This is a continuation of melt-umn/silver#193.
We essentially want to add a mechanism similar to disambiguation functions, except as an open set of ambiguous terminals where we don't know, at the declaration, everything that may possibly be a member of the class. A disambiguation class should also apply for any ambiguity that is a subset of its terminals, as opposed to a fixed set of terminals. Since the logic to select a member of the class remains the same no matter what terminals are actually in the set of potential matches, it may happen that a terminal is selected by the class that is not a potential match - in this case, a syntax error should be raised.
Some implementation considerations:

We are considering implementing disambiguation classes as an optional component of terminal classes, containing a piece of code to run indicating which terminal should be selected. The copper compiler would treat any ambiguity where all terminals are members of the same terminal class containing a disambiguation expression as resolved.
The code corresponding to a disambiguation class would typically involve looking up the lexeme in some sort of environment to find which terminal to match, similar to the lexer hack in C (which itself would be re-implemented as a disambiguation class.) This means that terminals must be identified in some way outside of the disambiguation class code, so the disambiguation function approach of generating an identifier corresponding to the Copper name of each terminal is not an option. One possibility is to have the code evaluate to a String containing the Copper name of the chosen identifier, which would then be looked up from a generated table to find the actual BitSet index of the terminal, however this may have performance issues. However I can't think of a better approach, such as querying the information from the parser at startup, since the code constructing the disambiguation environment would be totally independent from Copper and could potentially be used in multiple parsers generated from the same grammar.

Any other thoughts/comments?

index and endIndex in ableC

I'm trying to get byte indices within a C file after parsing with ableC, using the Message.index and Message.endIndex attributes; however, it appears that they are relative to the preprocessed source, not the file's source. The line and column numbers are updated in edu/umn/cs/melt/ableC/concretesyntax/cppTags/CPPTags.sv, in an action block. Is it possible to update index and endIndex in the same way, or is that not possible/would cause breakage?

@ericvanwyk is sitting here and he doesn't know either

Getting first set for nonterminals?

If I wanted to have Copper send Silver back the first set for every nonterminal, what would be the place to start? I want this for melt-umn/silver#205, so nonterminals can be included in "alias groups."

Semantic actions on layout terminals are delayed

For context, this problem showed up in ableC, where we are trying to use the C preprocessor tags to control the disambiguation of identifiers, based on whether we are currently parsing a system header file. These preprocessor tags are recognized as a layout terminal, with a semantic action that sets a parser attribute which is later referenced in a disambiguation function.

The following is a simple example in Silver:

grammar layoutactions;

parser attribute isMarked :: Boolean action { isMarked = false; };

ignore terminal Marker_t '#'
  action {
    print "Shifted Marker";
    isMarked = true;
  };

terminal Id1_t /[a-zA-Z]+/ action { print "Shifted Id1"; };
terminal Id2_t /[a-zA-Z]+/ action { print "Shifted Id2"; };

disambiguate Id1_t, Id2_t {
  print "Disambiguating Id";
  pluck if isMarked then Id1_t else Id2_t;
}

nonterminal Foo;
concrete productions top::Foo
| Id1_t {} action { print "Reduced 1"; }
| Id2_t {} action { print "Reduced 2"; }

parser parse :: Foo {  layoutactions; }
function main 
IOVal<Integer> ::= args::[String] ioin::IO
{
  local result::ParseResult<Foo> = parse(head(args), "test");
  return
    if null(args) then ioval(print("no argument\n", ioin), 1)
    else if !result.parseSuccess then ioval(print(result.parseErrors ++ "\n", ioin), 0)
    else ioval(ioin, 0);
}

When run on the input #a, this prints

Disambiguating Id
Shifted Marker
Shifted Id2
Reduced 2

The disambiguation function is being called before the semantic action for '#' has been run, thus giving the wrong result.

As a (very hacky) workaround, one can introduce a second, ambiguous layout terminal and use a disambiguation function to update the parser attribute in the disambiguation function:

ignore terminal Marker2_t '#';
disambiguate Marker_t, Marker2_t {
  print "Disambiguating Marker";
  isMarked = true;
  pluck Marker_t;
}

This gives the desired output of

Disambiguating Marker
Disambiguating Id
Shifted Marker
Shifted Id1
Reduced 1

This workaround does work in the context of ableC but is something that we would greatly wish to avoid.

Since we evidently know which layout terminal is going to be shifted as soon as the disambiguation function is run, why aren't the semantic actions executed immediately, before attempting to disambiguate the next terminal? I vaguely remember there being a reason for this, but I don't remember what it was.

"Remove nonterminals from the 'validLA' sets in the parse table."

TODOs in several places in the codebase (''e.g.'', SingleDFAEngineBuilder.java) say, "Remove nonterminals from the 'validLA' sets in the parse table."

index and endIndex don't count newlines for parser errors

For a simple functional grammar,

1 + 2 * 3
thisShouldTriggerASyntaxError

has the index and endIndex of the error the same as for

1 + 2 * 3



thisShouldTriggerASyntaxError

Follow spillage error message is slightly buggy

This is sort of two issues.

First, the reported spillage seems to be computed incorrectly:

Nonterminal edu:umn:cs:melt:ableC:concretesyntax:Constant_c has follow spillage of
    [edu:umn:cs:melt:ableC:concretesyntax:LCurly_t,
    edu:umn:cs:melt:exts:ableC:tensors:concretesyntax:Cross_product,
    edu:umn:cs:melt:exts:ableC:tensors:concretesyntax:Dot_product,
    edu:umn:cs:melt:exts:ableC:tensors:concretesyntax:Float_dot_product,
    edu:umn:cs:melt:exts:ableC:tensors:concretesyntax:Tensor_multiply]

This first element of the list is the only actual spillage. The other 4 are marking terminals. I hope this is just a small error in the rendering of this error message, and not something more subtle. We fixed just the curly brace, and that made this error go away, at any rate.

Second, when this kind of spillage occurs, we start with a reasonably good error message:

Nonterminal edu:umn:cs:melt:ableC:concretesyntax:Constant_c has follow spillage of

Since this tells you to go look for where your syntax uses that terminal. But then we follow it up with 147 (for spillage on expressions in AbleC) errors of this sort:

DFA state 880, item 0 has lookahead spillage of

which are both too numerous and not especially helpful. They don't even refer to state that contain the problematic production, it's seemingly just any state that can reduce to one of the problematic nonterminals.

For the moment, I think we should just eliminate reporting this second kind of error? They aren't adding anything helpful that I can see. Any actual problem should already have an error raised on the nonterminal, no?

Ideally, we'd be able to discover the cause of the spillage and report that production, but I could maybe see how that'd be hard.

Require all operator precedence classes to be explicitly declared.

As a backward-compatibility measure with Copper 0.6, operator precedence classes are currently allowed to be declared implicitly in XML specs as a deprecated practice [1] [2]. This kludge should be removed as part of the resolution of issue #25.

Smarter behavior for 'input currently matches' in parse errors

I remembered and wanted to write down this issue today.

Right now, whenever there is a syntax error in a Silver source file, we get this as part of the parse error:

   Input currently matches:
   [silver:extension:templating:syntax:QuoteWater,
    silver:extension:templating:syntax:SingleLineQuoteWater]

These terminals are pretty much [^"]+ regexes, so naturally maximal-munch prefers them, almost always, over anything else, despite these terminals only being valid in rare contexts. Nearly always, something more specific should be reported.

I'm not sure if there's a perfect solution to this problem. Possibly a simple heuristic would be to resort to using an "all terminals are valid" lex for error reporting ONLY if there are NO matches for the set of terminals valid in the current parser state. But this may not be enough?

I'm not sure if there's any sensible way to try to enlarge the scope of valid terminals without going all the way to recognizing everything.

Document MDA and other advanced Copper features.

Currently, the Copper user manual covers only those features accessible from the CUP skin. Beyond what appears in the Javadoc and my thesis, the modular determinism analysis and other advanced features (parse table composition, etc.) are undocumented.

Attempts to produce this documentation in the past have been stymied by the fact that the features are only accessible through the XML skin and Java API, which were not meant to be used directly. However, if we regard the Javadoc for the grammarbeans package as suitably documenting the API, some usable documentation could be produced.

Create a JUnit test harness for Copper.

Before the new release of Copper, it should be outfitted with its own test harness.

Historically I have avoided creating a test harness based on JUnit to avoid making the Copper build dependent on anything but the Java standard class library and ANT. This independence can be maintained if we move the existing Copper project into a directory Copper and then add a new Eclipse project in a directory CopperTests, with dependency links to the original project and to JUnit.

We can start with a set of regression tests that build each of the existing test grammars and then try to compile the resulting parsers. Later, runtime tests should be added as well, with at least one valid and one invalid input for each parser.

Ability to add interfaces to generated parsers

Right now we can insert code into the parser class via ClassAuxiliaryCode in a Copper spec. We'd like to also be able to add interfaces to the generated class as well, to make it easier to make use of that generated aux code.

I checked like so:

$ grep "class \" + parserName" src -R
src/edu/umn/cs/melt/copper/compiletime/srcbuilders/single/SingleDFAEngineBuilder.java:		out.print("public class " + parserName + " extends " + SingleDFAEngine.class.getName() + "<" + rootType + "," + errorType + ">\n");
src/edu/umn/cs/melt/copper/compiletime/srcbuilders/single/ParserFragmentEngineBuilder.java:        out.println("public class " + parserName + " extends " + ParserFragmentEngine.class.getName() + "<" + rootType + "," + errorType + "> {");
src/edu/umn/cs/melt/copper/legacy/compiletime/srcbuilders/enginebuilders/single/SingleDFAEngineBuilder.java:		out.print("public class " + parserName + " extends " + SingleDFAEngine.class.getName() + "<" + rootType + "," + errorType + ">\n");
src/edu/umn/cs/melt/copper/legacy/compiletime/srcbuilders/enginebuilders/moded/ModedEngineBuilder.java:		out.print("public class " + parserName + " extends " + ModedEngine.class.getName() + "<" + rootType + "," + errorType + ">\n");
src/edu/umn/cs/melt/copper/legacy/compiletime/srcbuilders/enginebuilders/split/SplitEngineBuilder.java:		out.print("public class " + parserName + " extends " + SplitEngine.class.getName() + "<" + rootType + "," + errorType + ">\n");
src/edu/umn/cs/melt/copper/legacy/compiletime/srcbuilders/enginebuilders/lalr/LALREngineBuilder.java:		out.print("public class " + parserName + " extends " + LALREngine.class.getName() + "\n");

and it looks like this isn't currently possible.

Perhaps an optional <ClassInterfaces> tag with 0 or more <Interface> children? (Inside the <Parser> tag presumably?)

MDA for disambiguation classes fails due to reference of extension terminals from host grammar

Disambiguation classes, as implemented currently in Silver, involve creating a disambiguation function applicable to subsets that includes all known host and extension terminals that are members of the class. This disambiguation function is considered part of the "host" since it is declared in a host grammar and might need to disambiguate solely between host terminals. However, it may also contain extension terminals, which is currently disallowed by the MDA rules.
Simply excluding disambiguation functions applicable to subsets from this check could be a solution, but feels like kind of a hack. Similarly, having Silver simply place such generated disambiguation functions in the extension portion of the MDA specification would avoid such errors, but it would complicate the Silver implementation significantly, and seems to side-step the more general issue at play here. I don't really know the best way forward here, @schwerdf any thoughts on this?

Partial Parsing?

Is a partial parse in the case of error currently supported? Eric said he thought that there was some work on that front.