Giter VIP home page Giter VIP logo

php-peg's Introduction

PHP PEG - A PEG compiler for parsing text in PHP

This is a Parsing Expression Grammar compiler for PHP. PEG parsers are an alternative to other CFG grammars that includes both tokenization and lexing in a single top down grammar. For a basic overview of the subject, see http://en.wikipedia.org/wiki/Parsing_expression_grammar

Quick start

  • Write a parser. A parser is a PHP class with a grammar contained within it in a special syntax. The filetype is .peg.inc. See the examples directory.
  • Compile the parser: php ./cli.php ExampleParser.peg.inc > ExampleParser.php
  • Use the parser (you can also include code to do this in the input parser - again see the examples directory):
$x = new ExampleParser( 'string to parse' ) ;
$res = $x->match_Expr() ;

Parser Format

Parsers are contained within a PHP file, in one or more special comment blocks that start with /*!* [name | !pragma] (like a docblock, but with an exclamation mark in the middle of the stars).

You can have multiple comment blocks, all of which are treated as contiguous for the purpose of compiling. During compilation these blocks will be replaced with a set of "matching" functions (functions which match a string against their rules) for each rule in the block.

The optional name marks the start of a new set of parser rules. This is currently unused, but might be used in future for optimization & debugging purposes. If unspecified, it defaults to the same name as the previous parser comment block, or 'Anonymous Parser' if no name has ever been set.

If the name starts with an '!' symbol, that comment block is a pragma, and is treated not as some part of the parser, but as a special block of meta-data.

Lexically, these blocks are a set of rules & comments. A rule can be a base rule or an extension rule.

Base rules

Base rules consist of a name for the rule, some optional arguments, the matching rule itself, and an optional set of attached functions.

NAME ( "(" ARGUMENT, ... ")" )? ":" MATCHING_RULE
  ATTACHED_FUNCTIONS?

Names must be the characters a-z, A-Z, 0-9 and _ and - only, and must not start with a number.

Base rules can be split over multiple lines as long as subsequent lines are indented.

Extension rules

Extension rules are either the same as a base rule but with an addition name of the rule to extend, or as a replacing extension consist of a name for the rule, the name of the rule to extend, and optionally: some arguments, some replacements, and a set of attached functions.

NAME extend BASENAME ( "(" ARGUMENT, ... ")" )? ":" MATCHING_RULE
  ATTACHED_FUNCTIONS?

NAME extends BASENAME ( "(" ARGUMENT, ... ")" )? ( ";" REPLACE "=>" REPLACE_WITH, ... )?
  ATTACHED_FUNCTIONS?
Tricks and traps

We allow indenting a parser block, but only in a consistent manner - whatever the indent of the /*** marker becomes the "base" indent, and needs to be used for all lines. You can mix tabs and spaces, but the indent must always be an exact match - if the "base" indent is a tab then two spaces, every line within the block also needs indenting with a tab then two spaces, not two tabs (even if in your editor, that gives the same indent).

Any line with more than the "base" indent is considered a continuation of the previous rule.

Any line with less than the "base" indent is an error.

This might get looser if I get around to re-writing the internal "parser parser" in php-peg, bootstrapping the whole thing.

Rules

PEG matching rules try to follow standard PEG format, summarised thusly:

token* - Token is optionally repeated
token+ - Token is repeated at least one
token? - Token is optionally present

tokena tokenb - Token tokenb follows tokena, both of which are present
tokena | tokenb - One of tokena or tokenb are present, preferring tokena

&token - Token is present next (but not consumed by parse)
!token - Token is not present next (but not consumed by parse)

( expression ) - Grouping for priority

But with these extensions:

< or > - Optionally match whitespace
[ or ] - Require some whitespace

Tokens

Tokens may be:

  • bare-words, which are recursive matchers - references to token rules defined elsewhere in the grammar,
  • literals, surrounded by " or ' quote pairs. No escaping support is provided in literals.
  • regexes, surrounded by / pairs.
  • expressions - single words (match \w+) starting with $ or more complex surrounded by ${ } which call a user defined function to perform the match
Regular expression tokens

Automatically anchored to the current string start - do not include a string start anchor (^) anywhere. Always acts as when the 'x' flag is enabled in PHP - whitespace is ignored unless escaped, and '#' starts a comment.

Be careful when ending a regular expression token - the '/' pattern (as in /foo\s/) will end a PHP comment. Since the 'x' flag is always active, just split with a space (as in / foo \s* /).

Expressions

Expressions allow run-time calculated matching. You can embed an expression within a literal or regex token to match against a calculated value, or simply specify the expression as a token to match against a dynamic rule.

Expression stack

When getting a value to use for an expression, the parser will travel up the stack looking for a set value. The expression stack is a list of all the rules passed through to get to this point. For example, given the parser:

A: $a
B: A
C: B

The expression stack for finding $a will be C, B, A - in other words, the A rule will be checked first, followed by B, followed by C.

In terminals (literals and regexes)

The token will be replaced by the looked up value. To find the value for the token, the expression stack will be travelled up checking for one of the following:

  • A key / value pair in the result array node
  • A rule-attached method INCLUDING $ ( i.e. function $foo() )

If no value is found it will then check if a method or a property excluding the $ exists on the parser. If neither of those is found the expression will be replaced with an empty string.

As tokens

The token will be looked up to find a value, which must be the name of a matching rule. That rule will then be matched against as if the token was a recurse token for that rule.

To find the name of the rule to match against, the expression stack will be travelled up checking for one of the following:

  • A key / value pair in the result array node
  • A rule-attached method INCLUDING $ ( i.e. function $foo() )

If no value is found it will then check if a method or a property excluding the $ exists on the parser. If neither of those is found the rule will fail to match.

Tricks and traps

Be careful against using a token expression when you meant to use a terminal expression, for example:

quoted_good: q:/['"]/ string "$q"
quoted_bad:  q:/['"]/ string $q

"$q" matches against the value of q again. $q tries to match against a rule named " or ' (both of which are illegal rule names, and will therefore fail)

Named matching rules

Tokens and groups can be given names by prepending name and :, e.g.,

rulea: "'" name:( tokena tokenb )* "'"

There must be no space between the name and the :

badrule: "'" name : ( tokena tokenb )* "'"

Recursive matchers can be given a name the same as their rule name by prepending with just a :. These next two rules are equivalent:

rulea: tokena tokenb:tokenb
rulea: tokena :tokenb

Rule-attached functions

Each rule can have a set of functions attached to it. These functions can be defined:

  • in-grammar by indenting the function body after the rule
  • in-class after close of grammar comment by defining a regular method who's name is {$rulename}_{$functionname}, or {$rulename}{$functionname} if function name starts with _
  • in a sub class

All functions that are not in-grammar must have PHP compatible names (see PHP name mapping). In-grammar functions will have their names converted if needed.

All these definitions define the same rule-attached function:

<?php
class A extends Parser {
	/*!* Parser
	foo: bar baz
		function bar() {}
	*/

	function foo_bar() {}
}

class B extends A {
	function foo_bar() {}
}
?>

PHP name mapping

Rules in the grammar map to php functions named match_{$rulename}. However rule names can contain characters that php functions can't. These characters are remapped:

'-' => '_'
'$' => 'DLR'
'*' => 'STR'

Other dis-allowed characters are removed.

Results

Results are a tree of nested arrays.

Without any specific control, each rules result will just be the text it matched against in a ['text'] member. This member must always exist.

Marking a subexpression, literal, regex or recursive match with a name (see Named matching rules) will insert a member into the result array named that name. If there is only one match it will be a single result array. If there is more than one match it will be an array of arrays.

You can override result storing by specifying a rule-attached function with the given name. It will be called with a reference to the current result array and the sub-match - in this case the default storage action will not occur.

If you specify a rule-attached function for a recursive match, you do not need to name that token at all - it will be called automatically, e.g.

rulea: tokena tokenb
  function tokenb ( &$res, $sub ) { print 'Will be called, even though tokenb is not named or marked with a :' ; }

You can also specify a rule-attached function called *, which will be called with every recursive match made:

rulea: tokena tokenb
  function * ( &$res, $sub ) { print 'Will be called for both tokena and tokenb' ; }

Silent matches

By default all matches are added to the 'text' property of a result. By prepending a member with . that match will not be added to the ['text'] member. This doesn't affect the other result properties that named rules' add.

Inheritance

Rules can inherit off other rules using the keyword extends. There are several ways to change the matching of the rule, but they all share a common feature - when building a result set the rule will also check the inherited-from rule's rule-attached functions for storage handlers. This lets you do something like:

A: Foo Bar Baz
  function *(){ /* Generic store handler */ }
  
B extends A
  function Bar(){ /* Custom handling for Bar - Foo and Baz will still fall through to the A#* function defined above */ }

The actual matching rule can be specified in three ways:

Duplication

If you don't specify a new rule or a replacement set the matching rule is copied as is. This is useful when you want to override some storage logic but not the rule itself.

Text replacement

You can replace some parts of the inherited rule using test replacement by using a ';' instead of an ':' after the name of the extended rule. You can then put replacements in a comma separated list. An example might help:

A: Foo | Bar | Baz

# Makes B the equivalent of Foo | Bar | (Baz | Qux)
B extends A: Baz => (Baz | Qux)

Note that the replacements are not quoted. The exception is when you want to replace with the empty string, e.g.

A: Foo | Bar | Baz

# Makes B the equivalent of Foo | Bar
B extends A: | Baz => ""

Currently there is no escaping supported - if you want to replace "," or "=>" characters you'll have to use full replacement.

Full replacement

You can specify an entirely new rule in the same format as a non-inheriting rule, e.g.

A: Foo | Bar | Baz

B extends A: Foo | Bar | (Baz Qux)

This is useful is the rule changes too much for text replacement to be readable, but want to keep the storage logic.

Pragmas

When opening a parser comment block, if instead of a name (or no name) you put a word starting with '!', that comment block is treated as a pragma - not part of the parser language itself, but some other instruction to the compiler. These pragmas are currently understood:

!silent

  This is a comment that should only appear in the source code. Don't output it in the generated code.

!insert_autogen_warning

  Insert a warning comment into the generated code at this point, warning that the file is autogenerated and not to edit it.

TODO

  • Allow configuration of whitespace - specify what matches, and whether it should be injected into results as-is, collapsed, or not at all
  • Allow inline-ing of rules into other rules for speed
  • More optimisation
  • Make Parser-parser be self-generated, instead of a bad hand rolled parser like it is now.
  • PHP token parser, and other token streams, instead of strings only like now

php-peg's People

Contributors

cognifloyd avatar peter-kolenic avatar simonwelsh avatar uwetews avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

php-peg's Issues

Predicates works in a very weird manner

Predicates works in a very weird manner. For example this doesn't work as expected:
static_content_symbol: ( / [^\{] / | "{" !"{" | "{" !"*" )+
but this (which is logically equivalent) works fine:
static_content_symbol: ( / [^\{] / | "{" !( "{" | "*" ) )+
and this is not because of several predicates in the same rule, because this one works ok:
comment_content: ( / [^\*\{] / | "*" !"}" | "{" !"*" ) comment_content | comment comment_content | ""
Maybe this is because of the same start "{" of both alternatives in the first (nonworking) example?
Full grammar and test sample see in description of bug #6.

Really important: Packrat sometimes doesn't work!!!

Packrat sometimes doesn't work!!! For example this grammar:
comment_content: ( / [^\*\{] / | "*" !"}" | "{" !"*" ) comment_content | comment comment_content | "" comment: "{*" comment_content "*}" static_content_symbol: ( / [^\{] / | "{" !( "{" | "*" ) )+ program: (static_content_symbol | comment) program | ""
leads to infinite loop on this example:
sjfasldkjf sdfjsdf sdfkljf {* dd {* fdf**{dj}iujljl {*y**} f *} dc{**}od *} aaaaazzz
but it works nice using ordinary algorithm!
This may happened because of cyclic definitions or because of using predicates inside that cycles ... or something else :)

This is really important because without packrat most of texts cannot be parsed in appropriate time!

Names of nonterminals that contain "-" do not work

Names of nonterminals that contain "-" do not work (in spite of their wide usage in examples!). For example this rule will lead to incorrectly generated code: "sub-domain: domain-ref | domain-literal".

Group names inside rules sometimes don't work as expected

Group names inside rules sometimes don't work as expected because they do not share params with other parts of the rule. Example:
unknown_attribute: name:simple_name pws ":" pws value:(expression | identifier) function name (&$res, $sub) { $res['content'][] = '"'; $res['content'][] = $sub['text']; } function value (&$res, $sub) { //$res['content'][] = $sub['text']; // <-- thist works good, but I don't need it $res['content'][] = $sub['content']; // <-- this doesn't work because it doesn't know what is $sub['content'] }
There is workaround:
unknown_attribute: name:simple_name pws ":" pws (value:expression | value:identifier) function name (&$res, $sub) { $res['content'][] = '"'; $res['content'][] = $sub['text']; } function value (&$res, $sub) { $res['content'][] = $sub['content']; }
Pay attention to "value:" name in both examples

Rule comments

The only way to add a comment to the rule I have found is like this:

`pws: / [\t\r\n\x20]* /

"pws" means "Possible White Space"`

(before # must be two spaces but github text editor had removed them)

Here everything is important - writing it exactly on the next row and exactly using some leading white space! Isn't it quite strange?

Rule names sometimes don't work

This doesn't match:
static: (static_content:( / [^\{] / | "{" !("{" | "*") )+ | comment)+ function static_content (&$res, $sub) { $res['content'][0] = "'"; $res['content'][1] .= $sub['text']; }
but this (which is logically the same) works ok:
static_content: ( / [^\{] / | "{" !("{" | "*") )+ static: (static_content | comment)+ function static_content (&$res, $sub) { $res['content'][0] = "'"; $res['content'][1] .= $sub['text']; }

PHP7 support

Not really an issue, but to get it working with the default configuration, right now, you need disable pcre.jit for the cli.php & compiler:

ini_set("pcre.jit", "0");

works as intended afterwards.

PHP5 notices

Some code in "Parser.php" leads to several PHP5 notices like " Declaration of ConservativePackrat::packhas() should be compatible with that of Parser::packhas()" - it is not an error but why not to fix it.

Regexes can't be mixed with nonterminals

This one will be understood incorrectly by the parser (as a single regex!):

"external_identifier: / [a-zA-Z]+ / | integer_literal"

There is a workaround: get it into parentheses like this: ( / [a-zA-Z]+ / | integer_literal" ) but i suppose it should work without any bells and whistles.

psr-0 standard compliance

For better integration with other libraries, using the psr-0 standard would be really usefull.
It would also simplify development and testing a lot, by giving a more organized directory structure.

I'm currently working on improving the library, by adding new features, like arbitrary quantifiers in the pcre form ({2,4}, {4,}, etc...), case insensitive literals (globally or case by case), unicode modifiers, etc...
But having all these classes sticked into one big file is really slowing down the process.

Would you like me to implement this ?
Proposed directory layout:

+ hafriedlander
 \-+ Peg
   |-- Compiler.php
   |-+ Compiler
   | |-- Builder.php
   | |-- Flags.php
   | \-- Writer.php
   |-+ Exception
   | |-- GrammarException.php
   | \-- ParseException.php
   |-- Parser.php
   |-+ Parser
   | |-- ConservativePackrat.php
   | |-- FalseOnlyPackrat.php
   | |-- Packrat.php
   | \-- Regex.php
   |-- PendingRule.php
   |-- Rule.php
   |-- RuleSet.php
   |-- Token.php
   \-+ Token
     |-- Expressionable.php
     |-- ExpressionedRecurse.php
     |-- Literal.php
     |-- Option.php
     |-- Recurse.php
     |-- Regex.php
     |-- Sequence.php
     |-- Terminal.php
     \-- Whitespace.php

Secure cli.php to be available from CLI only

php-peg library is included by other projects, for example: https://github.com/maths/moodle-qtype_stack/tree/master/thirdparty/php-peg

In this example, the other library is deployed on available on a web server. This leads to opportunity to run any PHP script contained there using web URL - including https://github.com/maths/moodle-qtype_stack/blob/master/thirdparty/php-peg/cli.php .

If register_argc_argv is set in php.ini, then $_SERVER['argv'] is populated with $_GET so the data could be passed into Compiler::cli( $_SERVER['argv'] ) ; .

I don't think that in the current form cli.php can be exploited in any way but it may be a good idea to protect this script and make sure it only runs as CLI. This could be done with:

if (php_sapi_name() != "cli") {
    die();
}

Class name is too generic.

The class name 'Parser' is far too generic. This is a library rather than a host system so it should avoid generic words.
For example, if I wanted to use this inside of a MediaWiki extension I could not use it as-is. Since it would conflict with MediaWiki's Parser class.

Some problems with rules that include variables

There are some problems with rules that include variables. Example:
user_structure: tag_begin pws tag_name:simple_name save_tag_name:"" (rws attribute:user_attribute (rws attribute:user_attribute)*)? (rws tag_stub | (pws tag_end content:template_content? tag_begin pws end_tag_mark pws "$tag_name" pws tag_end)) function save_tag_name (&$res, $sub) { $res['content'][] = '{}'; $res['content'][] = $res['tag_name']['text']; } function attribute (&$res, $sub) { $res['content'][] = $sub['content']; } function content (&$res, $sub) { $res['content'][] = array ( '"', 'content', $sub['content'] ); }
Here is impossible to declate function "tag_name" to save the tag name because this name is used as variable at the end of the rule.
If I declare function tag_name (&$res, $sub) then variable $tag_name doesn't match!

Another bug is appeared if use variable "name" instead of variable "tag_name". With "name" it doesn't work at all!
Probably it is because parser tree in $res already has member "name" .. but who knows about this?

Bug with pull request #16

One line I forgot to change:

// in Compiler.php on line 691:
$c = substr($sub, 0, 1);

should become

$c = substr($str, $o, 1);

Exceptions if process ends prematurely

There should be some hooks to give user the chance to throw an Exception if the process of parsing is premuterely completed, es. Syntax Error. At the moment the process just stop and return an imcomplete tree without any hint about why is incomplete, what characters stopped the process and at wich position occurred.

Add a Github Service Hook for Packagist

I just submitted php-peg to packagagist (and added hafriedlander as a maintainer):
https://packagist.org/packages/hafriedlander/php-peg

From the Packagist website:

Enabling the Packagist service hook ensures that your package will always be updated instantly when you push to GitHub. To do so you can go to your GitHub repository, click the "Settings" button, then "Service Hooks". Pick "Packagist" in the list, and add your API token (see above), plus your Packagist username if it is not the same as on GitHub. Check the "Active" box and submit the form.

Grab the API Token from:
https://packagist.org/profile/

Predicates must not call attached functions

Currently matches within predicates execute attached functions. It's impossible to distinguish whether call came from a legit match or from a predicate.

One possible fix is to pass stack into the callback.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.