Giter VIP home page Giter VIP logo

regex's Introduction

Hoa


Build status Code coverage Packagist License

Hoa is a modular, extensible and structured set of PHP libraries.
Moreover, Hoa aims at being a bridge between industrial and research worlds.

Hoa\Regex

Help on IRC Help on Gitter Documentation Board

This library provides tools to analyze regular expressions and generate strings based on regular expressions (Perl Compatible Regular Expressions).

Learn more.

Installation

With Composer, to include this library into your dependencies, you need to require hoa/regex:

$ composer require hoa/regex '~1.0'

For more installation procedures, please read the Source page.

Testing

Before running the test suites, the development dependencies must be installed:

$ composer install

Then, to run all the test suites:

$ vendor/bin/hoa test:run

For more information, please read the contributor guide.

Quick usage

As a quick overview, we propose to see two examples. First, analyze a regular expression, i.e. lex, parse and produce an AST. Second, generate strings based on a regular expression by visiting its AST with an isotropic random approach.

Analyze regular expressions

We need the Hoa\Compiler library to lex, parse and produce an AST of the following regular expression: ab(c|d){2,4}e?. Thus:

// 1. Read the grammar.
$grammar  = new Hoa\File\Read('hoa://Library/Regex/Grammar.pp');

// 2. Load the compiler.
$compiler = Hoa\Compiler\Llk\Llk::load($grammar);

// 3. Lex, parse and produce the AST.
$ast      = $compiler->parse('ab(c|d){2,4}e?');

// 4. Dump the result.
$dump     = new Hoa\Compiler\Visitor\Dump();
echo $dump->visit($ast);

/**
 * Will output:
 *     >  #expression
 *     >  >  #concatenation
 *     >  >  >  token(literal, a)
 *     >  >  >  token(literal, b)
 *     >  >  >  #quantification
 *     >  >  >  >  #alternation
 *     >  >  >  >  >  token(literal, c)
 *     >  >  >  >  >  token(literal, d)
 *     >  >  >  >  token(n_to_m, {2,4})
 *     >  >  >  #quantification
 *     >  >  >  >  token(literal, e)
 *     >  >  >  >  token(zero_or_one, ?)
 */

We read that the whole expression is composed of a single concatenation of two tokens: a and b, followed by a quantification, followed by another quantification. The first quantification is an alternation of (a choice betwen) two tokens: c and d, between 2 to 4 times. The second quantification is the e token that can appear zero or one time.

We can visit the tree with the help of the Hoa\Visitor library.

Generate strings based on regular expressions

To generate strings based on the AST of a regular expressions, we will use the Hoa\Regex\Visitor\Isotropic visitor:

$generator = new Hoa\Regex\Visitor\Isotropic(new Hoa\Math\Sampler\Random());
echo $generator->visit($ast);

/**
 * Could output:
 *     abdcde
 */

Strings are generated at random and match the given regular expression.

Documentation

The hack book of Hoa\Regex contains detailed information about how to use this library and how it works.

To generate the documentation locally, execute the following commands:

$ composer require --dev hoa/devtools
$ vendor/bin/hoa devtools:documentation --open

More documentation can be found on the project's website: hoa-project.net.

Getting help

There are mainly two ways to get help:

Contribution

Do you want to contribute? Thanks! A detailed contributor guide explains everything you need to know.

License

Hoa is under the New BSD License (BSD-3-Clause). Please, see LICENSE for details.

regex's People

Contributors

hywan avatar jubianchi avatar metalaka avatar shulard avatar stephpy avatar turkanis avatar vonglasow avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

regex's Issues

Immutable nodes + DSL-specific nodes

Hi,

After trying to reassemble regex with some modifications in AST, I suggest to add in TODO list 2 things:

  1. Make DSL-specific nodes, e.g. NamedCapturingGroupNode.
  2. Make nodes immutable.

Why?

  1. In order to change AST, I'd need just create new node which already has all node-specific fields in the constructor, quite easy.
  2. API of nodes is cleaner: fewer methods.
  3. You don't have to think about "modify existing one or create new one?". Modifying existing is dangerous as you can forget to update something. Creating new one is hard when Node is just generic class.

I compare my DX with PHP-Parser's Traverser's.

Mutable nodes also are weird from semantical point of view. Changed regex is different regex, it's a value object. Mutable VO makes no sense.

However, I'd say this library is very unique for PHP, I didn't find working analogues (even considering fact that this lib doesn't support full PCRE spec, e.g. (*VERB)). The library is cool anyway.

Support internal options setting

See http://pcre.org/pcre.txt.
Quoting:

INTERNAL OPTION SETTING

The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
PCRE_EXTENDED options (which are Perl-compatible) can be changed from
within the pattern by a sequence of Perl option letters enclosed
between "(?" and ")". The option letters are

i for PCRE_CASELESS
m for PCRE_MULTILINE
s for PCRE_DOTALL
x for PCRE_EXTENDED

For example, (?im) sets caseless, multiline matching. It is also possi-
ble to unset these options by preceding the letter with a hyphen, and a
combined setting and unsetting such as (?im-sx), which sets PCRE_CASE-
LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED,
is also permitted. If a letter appears both before and after the
hyphen, the option is unset.

The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA
can be changed in the same way as the Perl-compatible options by using
the characters J, U and X respectively.

When one of these option changes occurs at top level (that is, not
inside subpattern parentheses), the change applies to the remainder of
the pattern that follows. If the change is placed right at the start of
a pattern, PCRE extracts it into the global options (and it will there-
fore show up in data extracted by the pcre_fullinfo() function).

An option change within a subpattern (see below for a description of
subpatterns) affects only that part of the subpattern that follows it,
so

(a(?i)b)c

matches abc and aBc and no other strings (assuming PCRE_CASELESS is not
used). By this means, options can be made to have different settings
in different parts of the pattern. Any changes made in one alternative
do carry on into subsequent branches within the same subpattern. For
example,

(a(?i)b|c)

matches "ab", "aB", "c", and "C", even though when matching "C" the
first branch is abandoned before the option setting. This is because
the effects of option settings happen at compile time. There would be
some very weird behaviour otherwise.

Note: There are other PCRE-specific options that can be set by the
application when the compiling or matching functions are called. In
some cases the pattern can contain special leading sequences such as
(_CRLF) to override what the application has set or what has been
defaulted. Details are given in the section entitled "Newline
sequences" above. There are also the (_UTF8), (_UTF16),(_UTF32), and
(_UCP) leading sequences that can be used to set UTF and Unicode prop-
erty modes; they are equivalent to setting the PCRE_UTF8, PCRE_UTF16,
PCRE_UTF32 and the PCRE_UCP options, respectively. The (_UTF) sequence
is a generic version that can be used with any of the libraries. How-
ever, the application can set the PCRE_NEVER_UTF option, which locks
out the use of the (*UTF) sequences.

Incomplete support for internal option setting

Hi !

What works

  • Setting a single option: a(?i)b
  • Unsetting a single option: a(?-i)b

All the above work only for the i, m, s and x options.

What doesn't work:

  1. Setting / unsetting the U, X, and J options
  2. Setting several options: a(?im)b
  3. Unsetting several options: a(?-i-m)b
  4. Mixing the above two: a(?i-m)b
  5. Setting options for a non-capturing group: a(?i:b)c
  6. The grammar allows the (?+i) syntax, but according to the documentation and the PHP implementation this is invalid.

All the above fail with: Unexpected token "?" (zero_or_one) at line 1 and column 3

Possible fixes

Changing the grammar to:

// Internal options.
%token internal_option \(\?(-?[imsxJUX])+\)

solves n° 1, 2, 3, 4 & 6.
n° 5 is a bit more complex... 😉


Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

Simple `(...)` capturing group is not present in AST

$ast = $compiler->parse('~(foo)(bar)~');
$visitor = new Dump();
$visitor->visit($ast);
>  #expression
>  >  #concatenation
>  >  >  token(literal, ~)
>  >  >  #concatenation
>  >  >  >  token(literal, f)
>  >  >  >  token(literal, o)
>  >  >  >  token(literal, o)
>  >  >  #concatenation
>  >  >  >  token(literal, b)
>  >  >  >  token(literal, a)
>  >  >  >  token(literal, r)
>  >  >  token(literal, ~)

I tried to dump ast via regular var_dump(), but I see no attribute for indexed capturing group. #concatenation is something different.

When I walk through AST, #capturing is never present.


Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

Missing brackets in expression for anchor token in Grammar.pp

In definition of the token "anchor", the sequence of character "bBAZzG" should occur within a character class:

diff --git a/Source/Grammar.pp b/Source/Grammar.pp
index 4176085..15e1714 100644
--- a/Source/Grammar.pp
+++ b/Source/Grammar.pp
@@ -106,7 +106,7 @@
 // Please, see PCRESYNTAX(3), General Category properties, PCRE special category
 // properties and script names for \p{} and \P{}.
 %token character_type            \\([CdDhHNRsSvVwWX]|[pP]{[^}]+})
-%token anchor                    \\(bBAZzG)|\^|\$
+%token anchor                    \\([bBAZzG])|\^|\$
 %token match_point_reset         \\K
 %token literal                   \\.|.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.