Giter VIP home page Giter VIP logo

t-regx's Introduction

T-Regx

Build status Unit tests latest: 0.41.5 dependencies: 0

T-Regx | Regular Expressions library

PHP regular expressions brought up to modern standards.

See documentation at t-regx.com.

last commit commit activity Commits since Unit tests Unit tests Code Climate FQN PRs Welcome Gitter

OS Arch OS Arch OS Arch OS Arch

PHP Version PHP Version PHP Version PHP Version PHP Version PHP Version PHP Version

  1. Installation
  2. Examples
  3. Documentation
  4. T-Regx fiddle - Try online
  5. API
    1. For standard projects -pattern()
    2. For legacy projects - preg::match_all()
  6. Overview
    1. Prepared patterns
    2. Working with the developer
    3. Clean API
    4. Fatal errors
    5. Clean Code
    6. Exceptions vs. errors
  7. Comparison
    1. Exceptions over warnings/errors
    2. Working with the developer
    3. Written with clean API in mind
    4. Philosophy of Uncle Bob and "Clean Code"
  8. Plans for the future
  9. Sponsors
  10. License

Buy me a coffee!

Installation

Installation for PHP 7.1 and later (PHP 8 as well):

composer require rawr/t-regx

T-Regx only requires mb-string extension. No additional dependencies or extensions are required.

Examples

Illustration of methods match(), test() and count().

$pattern = Pattern::of("ups"); // pattern("ups") also works
$matcher = $pattern->match('yay, ups');

foreach ($matcher as $detail) {
   $detail->text();    // (string) "ups";
   $detail->offset();  // (int) 0
}

if (!$matcher->test()) {
   echo "No occurrances found";
} else {
   echo "Found {$matcher->count()} occurrences";
}

Documentation

Full API documentation is available at t-regx.com. List of changes is available in ChangeLog.md.

Quick links:

Try it online, in your browser!

Open T-Regx fiddle and start playing around right in your browser. Try now!

API

Choose the interface:

  • I choose the modern regex API:

    Scroll to see - pattern()->test(), pattern()->match(), pattern()->replace()

  • I choose to keep PHP methods (but protected from errors/warnings):

    Scroll to see - preg::match_all(), preg::replace_callback(), preg::split()

For standard projects, we suggest pattern(). For legacy projects, we suggest preg::match_all().

  • Standard T-Regx

    $pattern = Pattern::of("ups"); // pattern("ups") also works
    $matcher = $pattern->match('yay, ups');
    
    if (!$matcher->test()) {
      echo "Unmatched subject :/";
    }
    
    foreach ($matcher as $detail) {
      $detail->text();    // (string) "ups";
      $detail->offset();  // (int) 0
    }
    
    $pattern->replace('well, ups')->with('heck') // (string) "well, heck";
  • Legacy API

    try {
        preg::match_all('/?ups/', 'ups', $match, PREG_PATTERN_ORDER);
        echo $match[0][0];
    } catch (\TRegx\Exception\MalformedPatternException $exception) {
        echo "Invalid pattern";
    }

Why T-Regx stands out?

💡 See documentation at t-regx.com

  • Prepared patterns

    Using user data isn't always safe with PCRE (even with preg_quote()), as well as just not being that convenient to use. T-Regx provides dedicated solution for building patterns with unsafe user input. Choose Pattern::inject() for simply including user data as literals. Use Pattern::mask() to convert user-supplied masks into full-fledged patterns, safely. Use Pattern::template() for constructing more complex patterns.

    function makePattern($name): Pattern {
      if ($name === null) {
        return Pattern::of("name[:=]empty");
      }
      return Pattern::inject("name[:=]@;", [$name]); // inject $name as @
    }
    
    $gibberish = "(my?name)";
    $pattern = makePattern($gibberish);
    
    $pattern->test('name=(my?name)'); // (bool) true
  • Working with the developer

    • Simple methods
      • T-Regx exposes functionality by simple methods, which return int, string, string[] or bool, which aren't nullable. If you wish to do something with your match or pattern, there's probably a method for that, which does exactly and only that.
    • Strings:
    • Groups:
      • When using preg::match_all(), we receive an array, of arrays, of arrays. In contrast, T-Regx returns an array of groups: Group[]. Object Group contains all the information about the group.

      • Group errors:

        • When invalid group named is used get('!@#'), T-Regx throws \InvalidArgumentException.
        • When attempt to read a missing group, T-Regx throws NonexistentGroupException.
        • When reading a group that happens not to be matched, T-Regx throws GroupNotMatchedException.
  • Written with clean API

    • Descriptive, simple interface
    • Unicode support out-of-the-box
    • No Reflection used, No (...varargs), No (boolean arguments, true), (No flags, 1), [No [nested, [arrays]]]
    • Inconsistencies between PHP versions are eliminated in T-Regx
  • Protects you from fatal errors

    Certain arguments cause fatal errors with preg_() methods, which terminate the application and can't be caught. T-Regx will predict if given argument would cause a fatal error, and will throw a catchable exception instead.

  • T-Regx follows the philosophy of Uncle Bob and "Clean Code"

    Function should do one thing, it should do it well. A function should do exactly what you expect it to do.

  • Compatible with other tools and libraries

    Granted, Pattern::of() accepts undelimited pattern ((Foo){3,4}}) is not suitable with other PHP libraries, which work with delimited patterns (/(Foo){3,4}/), for example Laravel and Routing. For that case, use PcrePattern::of() which accepts plain-old standard PHP syntax.

  • Exceptions over warnings/errors

    • Unlike PHP methods, T-Regx doesn't use warnings/notices/errors for unexpected inputs:
      try {
        preg::match_all('/([a3]+[a3]+)+3/', 'aaaaaaaaaaaaaaaaaaaa 3');
      } catch (\TRegx\SafeRegex\Exception\CatastrophicBacktrackingException $exception) {
        // caught
      }
    • Detects malformed patterns in and throws MalformedPatternException. This is impossible to catch with preg_last_error().
      try {
        preg::match('/?ups/', 'ups');
      } catch (\TRegx\Exception\MalformedPatternException $exception) {
        // caught
      }
    • Not every error in PHP can be read from preg_last_error(), however T-Regx throws dedicated exceptions for those events.

Comparison

Ugly api

or

Pretty api

Current work in progress

Current development priorities, regarding release of 1.0:

  • Separate SafeRegex and CleanRegex into to two packages, so users can choose what they need #103
  • Add documentation to each T-Regx public method #17 [in progress]
  • Revamp of t-regx.com documentation [in progress]
  • Release 1.0

Sponsors

T-Regx is developed thanks to

JetBrains

License

T-Regx is MIT licensed.

t-regx's People

Contributors

bartko-s avatar danon avatar danonk avatar iquito avatar peter279k avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

t-regx's Issues

API updates

Add functionality to do this:

pattern('=(\d+?);')->replace($str)->by()->group(1)->callback('dechex');

instead of this

pattern('=(\d+?);')->replace($str)->callback(function (Match $m) {
    return dechex($m->group(1));
}

but i guess it only makes sense with callback() and byMap()

Add `groupBy()` for matching

Allow certain match() methods to return grouped values.

Methods:

  • groupBy(string|int) - groups by capturing group
  • groupByCallback(callable) - groups by user defined callback

Requirements:

  • All of them must be usable with filter()->groupBy().

Example for all()

pattern('\d+(?<unit>cm|mm)')->match('12cm 14mm 13cm 19cm 18mm 2mm')->groupBy('unit')
  ->all();
[
  'cm' => ['12cm', '13cm', '19cm'],
  'mm' => ['14mm', '18mm', '2mm']
]

Add `Match.userData()`

<?php
pattern('word')
    ->match('word')
    ->all()
    ->forEach(function (Match $m) {
        $m->setUserData('welp');
    })
    ->filter(function (Match $m) {
        return $m->getUserData() == 'welp' ; // true
    });

or maybe

<?php
$m->userData = 'welp';
return $m->userData == 'welp' ;

PS: Why even bother, when you can duck-type userData?

Handle PHP bug #77827 using PCRE Prepared Patterns

If flag x is used in a pattern, then all whitespaces are ignored.

If I use it in PatternBuilder:

Pattern::inject('/my duper pattern @input/x', [
    'input' => 'text with whitespaces'
]);

then those whitespaces in the user input will also be ignore, because of the x flag.

Add group replacement injecting

This feature might feel spuerfluous to some, because the result can be achieved with look aheads and look beahinds.

When replacing http://(/w+).com for http://google.com you could inject group 1 with 'xd', that would result in 'http://xd.com'.

Add `Pattern.match()` methods

Instead of returning ->all(). return an \Iterator to traverse elements.

Also for Pattern.match().group().iterator().

match()->first()

Example: (list of people, some with ages)

Clark, Brian 23, Olivia, Jane 24, Marcin 26

New API:

  • Returns a group from the first match or throws exception
    pattern('[a-z]+(?: (?<age>\d+))?')->match($s)->group('age')->first();
    Throw GroupNotMatchedException if subject matched, but group wasn't.

Replace `or`

Add API for handling replacements, that didn't happen.

The default behaviour:

  • Returning the input $subject - just like any replacement

The customizable behaviour:

  • Returning a predefined string, if no replacement happened - orReturn()
  • Returning generated string based on $subject - orElse()
    • Just returning $string is no different than default behaviour
  • Throwing an exception (ReplacementPatternNotMatchedException) - orThrow()

API:

Replace \d+ with '0' or return Unmatched: $subject:

pattern('\d+')->replace('Hello')
  ->first()
  ->orElse(function (string $subject) {
    return "Unmatched: '$subject'" ;
   })
  ->with('0');
No number found in: 'Hello'

Replace \d+ with '0' or return Unmatched:

first()->orReturn('Unmatched :/')->with('0');
Unmatched :/

Replace \d+ with '0' or throw InvalidArgumentException/default:

first()->orThrow(InvalidArgumentException::class)->with('0');
// or
first()->orThrow()->with('0');

Exception is thrown

Pattern.simplify()

Make a method that knows how to simplify a pattern.

patter('[0-9]{1,} and [.]{1}')->analyze()->simplify();
\d+ and \.

WITH SPECIAL CARE ABOUT ESCAPING

Fix automatic delimiters detection

Currently, pattern()/Pattern::of()/new Pattern(), each "guess" whether a pattern is delimited (with flags) or not.

This is ambiguous, for example:

  • is /abc/m delimited? Of course
  • is _abc_m delimited? pattern() would say yes
  • is ,(word),x delimited? pattern() would see ,(word), and x flag.
  • is ,(word),xF delimited? suddenly, xF are not valid flags so pattern() would see ,(word),xF with no flags.
  • is //s an empty pattern with s flag? Or //s pattern? pattern() would choose the first

So:

  • we must remove automatic delimiter detection, because it causes too much ambiguity
  • pattern()/Pattern::of() will now treat each pattern as undelimited by default. Validation will be performed, whether we have enough delimiters to delimit it properly. If we don't, the other method will have to be used.
  • New method will have to be added Pattern::pcre(), which will only accept a delimited pattern, that is:
    • Having a delimiters at the start and end of a string (+ additional flags).
    • No validation of pattern used with the new method should be done, in hope that PCRE by itself will validate it.

New method name, perhaps:

  • Pattern::standard() - standard pattern for PCRE, not so much for T-Regx
  • Pattern::basic()
  • Pattern::pcre() - unintuitive
  • Pattern::undelimited() - too long
  • Pattern::raw() - too crude

Simplicty

MatchesPattern and Base have very similar constructors and matches() methods, but they are ought to do completely different things.

substr()

For a pattern like this:

pattern('_ERROR$')->matches($subject);

to filter those values

NO_ERROR
UTF8_ERROR
RUNTIME_ERROR

CleanRegex could analyze the pattern, see if it doesn't contain any special characters (or all those characters are escaped) and use substr() and strpost() instead. If pattern happens to have ^ or $ at beginning or end, it would only match those sub strings at begining or end.

Fix Match.groups()

  • $match->groupOffsets() - to remove
  • $match->group('as')->offset() - ok
  • $match->group('as')->byteOffset() - todo
  • $match->groups()->offsets() - ok
  • $match->groups()->byteOffsets() - todo

Pattern.analyze().validate()

pattern()->valid() returns true/false, indicating whether a pattern is valid or not.

We also need pattern()->analyze()->validate() that will (or will not) throw an Exception with detailed message about the pattern and why it's invalid.

Possibly also a method pattern()->analyze()->validateWithSubject($s) that could check not only compile-time but also run-time errors (like utf8 errors or backtracking errors).

QuotePattern

Add a way to use user-supplied values inside patterns.

 return preg_match(
  '/\b' . preg_quote($str1, '/') . '\b with ' . preg_quote($str2, '/') . ' :\)/i', 
$str2);

It should:

  • be similar to prepared sql statements, but without presenting tokens in patterns (regexp are already complicated enough)
  • be simple to quote user input (not to remember the delimiter, not to use long method)
  • use multiple combinations of patterns and user input

Facade style API change proposal

For Facade style, instead of:

Pattern::pattern('[A-Z][a-z]+')->test($subject)

You could try:

Pattern::of('[A-Z][a-z]+')->test($subject)

Reads better for me :)

Allow for filtering replacements

ReplaceMatch already receives an analyzed pattern. This pattern could be used to iterate Matches, filter those out, and only perform replacement on the left out Matches.

  • Add ignoring() method which ignores replacements by polymorphism
  • Include user data between ignoring() and callback()
  • Disallow ignoring()->ignoring()->callback()
  • Add replace()->nth(i)
  • Add replace()->range()
  • Add replace()->rangeIfExists()

Add `match()->unique()`

Requirements:

  • Must be usable with unique()->filter() and filter()->unique()
  • Must be usable with unique()->groupBy()

Methods:

  • unique() identifies uniqueness by Match.text()
  • unique()->byGroup($p) identifies uniqueness by Match.group($p)
  • unique()->byCallback(callable) identifies uniqueness by callable which receives Match object

Makes sense for:

  • unique()->all()
  • unique()->only(int) - which retrieves certain amount of matches, and then uniquess them :D
    • So ['1', '1', '2', '3'] for unique()->only(3) would return:
      • ['1', '2'].
    • So ['1', '1', '2', '3'] for unique()->maxUnique(3)->only(4) would return:
      • ['1', '2', '3'].
  • unique()->count()
  • unique()->map(), flatMap()
  • unique()->forEach(), iterate(), iterator()
  • unique()->offsets()/byteOffsets(),
  • unique()->group('unit')->all() - finds unique matches, and returns unit groups from them
  • unique()->group('unit')->offsets()/byteOffsets(),

It should not be possible to call:
group('unit')->unique()
because what sense does it make to:
group(1)->unique()->byGroup(2)->group(3)

Incoming features

  • Huge Feature pattern()->explain() - returns nested arrays of human readable description of pattern for debugging.

Use git tags to release versions

Hi, I just found out about this library, it looks really good, but I have a question. Is there a reason why you aren't locking releases with git tags?

Fluent map()/filter()

pattern($p)->fluent()
  ->map($a -> $a)
  ->filter($a -> true)
  ->map($a -> $again)
  ->all(); // ->first() // ->forFirst() // ->only(10);

ReplacePattern update

Many changes have been made to MatchPattern and Match. Those changes must also be applied to ReplacePattern and ReplaceMatch.

Add `filter()` to match().group()

It should be possible to call

pattern()->match()->group()->map(); // done
pattern()->match()->group()->forEach(); // done
pattern()->match()->group()->fluent(); // done
pattern()->match()->group()->filter();
pattern()->match()->group()->forFirst(); // done
pattern()->match()->group()->offsets(); // done
  • TODO: See if result of Match.group() has alike interface with possible callback argument.

Level Up `Pattern.filter()`

What could be possible

pattern()->filter()->all();      // return all filtered values
pattern()->filter()->first();    // return the first of filtered values
pattern()->filter()->only(int);  // return limited filtered values

But that's problematic because elements of $a can be matched more than once :/

Common method `only()` for `match()`

Given that replace() implements common interface by limits

pattern()->replace()->first()    // limit 1
pattern()->replace()->all()      // limit -1
pattern()->replace()->only(int)  // limit $x 

and match() by preg methods

pattern()->match()->first()      // preg_match()
pattern()->match()->all()        // preg_match_all()
pattern()->match()->only(int)    // array_slice(preg_match_all(), 0, $i)

a question arises: Should there also be only() method?

pattern()->match()->only(int)

Return only offsets from `->match()`

This is a substitute for

// all
pattern()->match()->map(function ($m) {
    return $m->offset();
});

// first
pattern()->match()->first(function ($m) {
    return $m->offset();
});

For groups you could do

pattern()->match()->map(function ($m) {
    return $m->group('a')->offset(); // this already returns group offset
});

Add PHP Doc to every public method of T-Regx API

  • Include a detailed documentation of every public method
  • Include only @param, @returns and @throws tags.
  • Include @psalm and @phpstan annotations
  • Include @see and @link attributes linking to classes and documentation on t-regx
  • Add @api tag to public elements and @internal tag to implementation details (most of Internal/ namespace)

Add missing tests

  • For match()->only($i) for i$ is bigger than the amount of matches
  • For pattern()->match() with subject longer than 65536 characters.

Update exception message

pattern()->match()->offsets()->first()

throws
SubjectNotMatchedException : Expected to get first match, but subject was not matched

but could

SubjectNotMatchedException : Expected to get first match offset, but subject was not matched

Add `asInt()` to MatchPattern, GroupLimitPattern and Match

Add asInt() so that one can call

  •  pattern()->match()->asInt();
     pattern()->match()->asInt();
     pattern()->match()->asInt();
  • asInt() should be identical to:
    pattern()->match()
      ->fluent()
      ->map(function (Match $match) {
        return $match->toInt();
      })
      ->*();
    but faster, because Match won't have to be constructed
  • Also add for match()->group()

Returning `Match`

Allow Match and Match.group() to be returned by any method that it makes sense for it (and treat it as Match.text() and GroupMatch.text()). Use it for methods:

  • replace()->callback()
  • match()->map()
  • replace()->by()->group()->callback()

Helper static method for group name format validator

GroupName::isValid(2); // true
GroupName::isValid('group'); // true

GroupName::isValid('2asd'); // false
GroupName::isValid(-1); // false
GroupName::isValid(null); // false
GroupName::isValid([]); // false

And same for MatchGroup::isValidName() that only works for strings

CompositePattern

Common use case for preg_replace() are chained calls:

$r = preg_replace("/(!|\"|#|$|%|'|̣)/", '', $r);
$r = preg_replace("/(̀|́|̉|$|>)/", '', $r);
$r = preg_replace("'<[\/\!]*?[^<>]*?>'si", '', $r);
return $r;

Update ReadMe

  • Add info about calling preg_last_error() to readme
  • Add info about "always an exception" to table of contents

PS:

  • Add general PatternException (or maybe TRegxException) that all methods will call.

Compatibility break between PHP 7.2.12 and 7.2.13

Test ErrorsCleanerTest.shouldGetCompileError() passes for every version of PHP between 7.1.0-7.1.12, and keeps failing between 7.1.13-7.1.21 and for every 7.2.* version.

This is caused by this bug fix: https://bugs.php.net/bug.php?id=74183

For <7.1.12 causing compile time PCRE error only makes error_get_last() to return an error. It doesn't make preg_last_error() return anything (it returns 0 - PREG_NO_ERROR). Data from error_get_last() is enough to throw CompileError based exception.

Since 7.1.13 though, preg_last_error() starts to return data to throw an exception, so no longer CompileError is raised, but composite model BothHostError.

Solutions:

  • first: Write two separate tests for <7.1.12 and >7.1.13 (meh)
  • second: Make tests only rely on interfaces and return values, not on specific instances (better)

Problems:

  • first: Adds unnecessary complexity.
  • second: If the tests only relied on interfaces, this bug wouldn't have ever been found. (no longer a unit test, but an integration test. To unit test it, we need to know the exact implementation).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.