hoaproject / regex Goto Github PK

The Hoa\Regex library.

PHP 65.26% Pascal 34.74%

php hoa library regular-expressions regex regexgen parser

regex's Introduction

Hoa is a modular, extensible and structured set of PHP libraries.
Moreover, Hoa aims at being a bridge between industrial and research worlds.

Hoa\Regex

This library provides tools to analyze regular expressions and generate strings based on regular expressions (Perl Compatible Regular Expressions).

Learn more.

Installation

With Composer, to include this library into your dependencies, you need to require hoa/regex:

$ composer require hoa/regex '~1.0'

For more installation procedures, please read the Source page.

Testing

Before running the test suites, the development dependencies must be installed:

$ composer install

Then, to run all the test suites:

$ vendor/bin/hoa test:run

For more information, please read the contributor guide.

Quick usage

As a quick overview, we propose to see two examples. First, analyze a regular expression, i.e. lex, parse and produce an AST. Second, generate strings based on a regular expression by visiting its AST with an isotropic random approach.

Analyze regular expressions

We need the Hoa\Compiler library to lex, parse and produce an AST of the following regular expression: ab(c|d){2,4}e?. Thus:

// 1. Read the grammar.
$grammar  = new Hoa\File\Read('hoa://Library/Regex/Grammar.pp');

// 2. Load the compiler.
$compiler = Hoa\Compiler\Llk\Llk::load($grammar);

// 3. Lex, parse and produce the AST.
$ast      = $compiler->parse('ab(c|d){2,4}e?');

// 4. Dump the result.
$dump     = new Hoa\Compiler\Visitor\Dump();
echo $dump->visit($ast);

/**
 * Will output:
 *     >  #expression
 *     >  >  #concatenation
 *     >  >  >  token(literal, a)
 *     >  >  >  token(literal, b)
 *     >  >  >  #quantification
 *     >  >  >  >  #alternation
 *     >  >  >  >  >  token(literal, c)
 *     >  >  >  >  >  token(literal, d)
 *     >  >  >  >  token(n_to_m, {2,4})
 *     >  >  >  #quantification
 *     >  >  >  >  token(literal, e)
 *     >  >  >  >  token(zero_or_one, ?)
 */

We read that the whole expression is composed of a single concatenation of two tokens: a and b, followed by a quantification, followed by another quantification. The first quantification is an alternation of (a choice betwen) two tokens: c and d, between 2 to 4 times. The second quantification is the e token that can appear zero or one time.

We can visit the tree with the help of the Hoa\Visitor library.

Generate strings based on regular expressions

To generate strings based on the AST of a regular expressions, we will use the Hoa\Regex\Visitor\Isotropic visitor:

$generator = new Hoa\Regex\Visitor\Isotropic(new Hoa\Math\Sampler\Random());
echo $generator->visit($ast);

/**
 * Could output:
 *     abdcde
 */

Strings are generated at random and match the given regular expression.

Documentation

The hack book of Hoa\Regex contains detailed information about how to use this library and how it works.

To generate the documentation locally, execute the following commands:

$ composer require --dev hoa/devtools
$ vendor/bin/hoa devtools:documentation --open

More documentation can be found on the project's website: hoa-project.net.

Getting help

There are mainly two ways to get help:

On the #hoaproject IRC channel,
On the forum at users.hoa-project.net.

Contribution

Do you want to contribute? Thanks! A detailed contributor guide explains everything you need to know.

License

Hoa is under the New BSD License (BSD-3-Clause). Please, see LICENSE for details.

regex's People

Contributors

Stargazers

Watchers

Forkers

hywan stephpy shulard viveklucky249 vonglasow stafot unkind gaecom blmage immediate-media digitalkreativ alistair-zhong intracto djidji01 hoa-math-community

regex's Issues

No support for POSIX character classes

Hi,
Parsing [[:alnum:]] results in Unexpected token "[" (class_) at line 1 and column 2.

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

Implement positive and negative look behind and ahead assertions

negative look behind,
negative look ahead,
positive look behind,
positive look ahead.

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

Immutable nodes + DSL-specific nodes

Hi,

After trying to reassemble regex with some modifications in AST, I suggest to add in TODO list 2 things:

Make DSL-specific nodes, e.g. NamedCapturingGroupNode.
Make nodes immutable.

Why?

In order to change AST, I'd need just create new node which already has all node-specific fields in the constructor, quite easy.
API of nodes is cleaner: fewer methods.
You don't have to think about "modify existing one or create new one?". Modifying existing is dangerous as you can forget to update something. Creating new one is hard when Node is just generic class.

I compare my DX with PHP-Parser's Traverser's.

Mutable nodes also are weird from semantical point of view. Changed regex is different regex, it's a value object. Mutable VO makes no sense.

However, I'd say this library is very unique for PHP, I didn't find working analogues (even considering fact that this lib doesn't support full PCRE spec, e.g. (*VERB)). The library is cool anyway.

Incorrect handling of dashes inside character classes

Hi again 😉,

Parsing the pattern [\w-] throws Unexpected token "-" (range) at line 1 and column 4.
When encountered at the last position of a character class, a dash should produce a token with the value "-".

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

Support internal options setting

See http://pcre.org/pcre.txt.
Quoting:

INTERNAL OPTION SETTING

The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
PCRE_EXTENDED options (which are Perl-compatible) can be changed from
within the pattern by a sequence of Perl option letters enclosed
between "(?" and ")". The option letters are

i for PCRE_CASELESS
m for PCRE_MULTILINE
s for PCRE_DOTALL
x for PCRE_EXTENDED

For example, (?im) sets caseless, multiline matching. It is also possi-
ble to unset these options by preceding the letter with a hyphen, and a
combined setting and unsetting such as (?im-sx), which sets PCRE_CASE-
LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED,
is also permitted. If a letter appears both before and after the
hyphen, the option is unset.

The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA
can be changed in the same way as the Perl-compatible options by using
the characters J, U and X respectively.

When one of these option changes occurs at top level (that is, not
inside subpattern parentheses), the change applies to the remainder of
the pattern that follows. If the change is placed right at the start of
a pattern, PCRE extracts it into the global options (and it will there-
fore show up in data extracted by the pcre_fullinfo() function).

An option change within a subpattern (see below for a description of
subpatterns) affects only that part of the subpattern that follows it,
so

(a(?i)b)c

matches abc and aBc and no other strings (assuming PCRE_CASELESS is not
used). By this means, options can be made to have different settings
in different parts of the pattern. Any changes made in one alternative
do carry on into subsequent branches within the same subpattern. For
example,

(a(?i)b|c)

matches "ab", "aB", "c", and "C", even though when matching "C" the
first branch is abandoned before the option setting. This is because
the effects of option settings happen at compile time. There would be
some very weird behaviour otherwise.

Note: There are other PCRE-specific options that can be set by the
application when the compiling or matching functions are called. In
some cases the pattern can contain special leading sequences such as
(_CRLF) to override what the application has set or what has been
defaulted. Details are given in the section entitled "Newline
sequences" above. There are also the (_UTF8), (_UTF16),(_UTF32), and
(_UCP) leading sequences that can be used to set UTF and Unicode prop-
erty modes; they are equivalent to setting the PCRE_UTF8, PCRE_UTF16,
PCRE_UTF32 and the PCRE_UCP options, respectively. The (_UTF) sequence
is a generic version that can be used with any of the libraries. How-
ever, the application can set the PCRE_NEVER_UTF option, which locks
out the use of the (*UTF) sequences.

Incomplete support for internal option setting

Hi !

What works

Setting a single option: a(?i)b
Unsetting a single option: a(?-i)b

All the above work only for the i, m, s and x options.

What doesn't work:

Setting / unsetting the U, X, and J options
Setting several options: a(?im)b
Unsetting several options: a(?-i-m)b
Mixing the above two: a(?i-m)b
Setting options for a non-capturing group: a(?i:b)c
The grammar allows the (?+i) syntax, but according to the documentation and the PHP implementation this is invalid.

All the above fail with: Unexpected token "?" (zero_or_one) at line 1 and column 3

Possible fixes

Changing the grammar to:

// Internal options.
%token internal_option \(\?(-?[imsxJUX])+\)

solves n° 1, 2, 3, 4 & 6.
n° 5 is a bit more complex... 😉

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

Incorrect handling of left square bracket inside character classes

Hi,

Parsing the pattern [[] throws Unexpected token "[" (class_) at line 1 and column 2.
Left square brackets don't need escaping inside character classes, so they should produce a literal token with the value "[".

Add support for (*VERB)

http://pcre.org/current/doc/html/pcre2pattern.html#SEC27

It seems like there is no support yet for (*MARK:foo), (*FAIL), etc. They are rarely needed, but sometimes they are required.

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

Simple `(...)` capturing group is not present in AST

$ast = $compiler->parse('~(foo)(bar)~');
$visitor = new Dump();
$visitor->visit($ast);

>  #expression
>  >  #concatenation
>  >  >  token(literal, ~)
>  >  >  #concatenation
>  >  >  >  token(literal, f)
>  >  >  >  token(literal, o)
>  >  >  >  token(literal, o)
>  >  >  #concatenation
>  >  >  >  token(literal, b)
>  >  >  >  token(literal, a)
>  >  >  >  token(literal, r)
>  >  >  token(literal, ~)

I tried to dump ast via regular var_dump(), but I see no attribute for indexed capturing group. #concatenation is something different.

When I walk through AST, #capturing is never present.

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

The readme talks about non-existent documentation

The readme says Different documentations can be found on the website. but the website does not have any documentation about the Regex package

Missing brackets in expression for anchor token in Grammar.pp

In definition of the token "anchor", the sequence of character "bBAZzG" should occur within a character class:

diff --git a/Source/Grammar.pp b/Source/Grammar.pp
index 4176085..15e1714 100644
--- a/Source/Grammar.pp
+++ b/Source/Grammar.pp
@@ -106,7 +106,7 @@
 // Please, see PCRESYNTAX(3), General Category properties, PCRE special category
 // properties and script names for \p{} and \P{}.
 %token character_type            \\([CdDhHNRsSvVwWX]|[pP]{[^}]+})
-%token anchor                    \\(bBAZzG)|\^|\$
+%token anchor                    \\([bBAZzG])|\^|\$
 %token match_point_reset         \\K
 %token literal                   \\.|.

String generation does not work as expected with negative character classes

The generation rule for negative character classes does not work as expected: instead of generating a character which does not belong to the character class, it generates a printable ASCII character that is not part of the random characters generated for the class children.

Not quite sure how to fix this though, would you have any pointer as to where to start? Thanks!

hoaproject / regex Goto Github PK

regex's Introduction

Hoa\Regex

Installation

Testing

Quick usage

Analyze regular expressions

Generate strings based on regular expressions

Documentation

Getting help

Contribution

License

regex's People

Contributors

Stargazers

Watchers

Forkers

regex's Issues

What works

What doesn't work:

Possible fixes

Recommend Projects

Recommend Topics

Recommend Org