Giter VIP home page Giter VIP logo

tregexpr's Introduction

Test License: MIT

TRegExpr: library for regular expressions

Regular expressions engine in pure Object Pascal, suitable for Delphi and Free Pascal. Now it's part of Free Pascal project.

Documentation in English, Russian, German, Bulgarian, French and Spanish

regex.sorokin.engineer

Free Pascal fork

Git repository

tregexpr's People

Contributors

alexey-t avatar andgineer avatar comradekingu avatar fau avatar ilocit avatar leela52452 avatar lisapple avatar nathanbnm avatar pierreyager avatar slmaker avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tregexpr's Issues

Docs

ww

  • зачем тут второй пример для \tfoobar? уже есть для \t.
  • лучше всю инфо про \ci поместить в одну строку а не в 3

Better remove macOS from CI

macOS tests run SLOOWLY, first VM is installing 100*k packages, then VM installs 100M Lazarus package, it runs for 9minutes already, all Linux tests are passed long ago.

Suggest: Fixes to get working on never versions of Delphi

  • The new solution for regular expressions in Delphi has some serious drawback as well - in particular in e.g. XE2-XE4 when dealing with UCS-2 documents and is not maintained by 4d party so people can not upgrade that lib without buying new Delphi versions.
  • I want to have cross-compile code in Delphi and Lazarus. This is an important goal I think and very easy to reach.

I suggest adding these changes to the RegXepr unit:

{$IFDEF VER170} {$DEFINE D7} {$DEFINE D6} {$DEFINE D5} {$DEFINE D4} {$DEFINE D3} {$DEFINE D2} {$ENDIF} // D7
{$IFDEF VER180} {$DEFINE D7} {$DEFINE D6} {$DEFINE D5} {$DEFINE D4} {$DEFINE D3} {$DEFINE D2} {$ENDIF} // D7
{$IFDEF VER200} {$DEFINE D7} {$DEFINE D6} {$DEFINE D5} {$DEFINE D4} {$DEFINE D3} {$DEFINE D2} {$ENDIF} // D7
{$IFDEF VER210} {$DEFINE UniCode} {$DEFINE D7} {$DEFINE D6} {$DEFINE D5} {$DEFINE D4} {$DEFINE D3} {$DEFINE D2} {$ENDIF} // D7
{$IFDEF VER220} {$DEFINE UniCode} {$DEFINE D7} {$DEFINE D6} {$DEFINE D5} {$DEFINE D4} {$DEFINE D3} {$DEFINE D2} {$ENDIF} // D7
{$IFDEF VER230} {$DEFINE UniCode} {$DEFINE D7} {$DEFINE D6} {$DEFINE D5} {$DEFINE D4} {$DEFINE D3} {$DEFINE D2} {$ENDIF} // D7
{$IFDEF VER240} {$DEFINE UniCode} {$DEFINE D7} {$DEFINE D6} {$DEFINE D5} {$DEFINE D4} {$DEFINE D3} {$DEFINE D2} {$ENDIF} // D7
{$IFDEF VER250} {$DEFINE UniCode} {$DEFINE D7} {$DEFINE D6} {$DEFINE D5} {$DEFINE D4} {$DEFINE D3} {$DEFINE D2} {$ENDIF} // D7
{$IFDEF VER260} {$DEFINE UniCode} {$DEFINE D7} {$DEFINE D6} {$DEFINE D5} {$DEFINE D4} {$DEFINE D3} {$DEFINE D2} {$ENDIF} // D7
{$IFDEF VER270} {$DEFINE UniCode} {$DEFINE D7} {$DEFINE D6} {$DEFINE D5} {$DEFINE D4} {$DEFINE D3} {$DEFINE D2} {$ENDIF} // D7
{$IFDEF VER280} {$DEFINE UniCode} {$DEFINE D7} {$DEFINE D6} {$DEFINE D5} {$DEFINE D4} {$DEFINE D3} {$DEFINE D2} {$ENDIF} // D7

and maybe

{$IFDEF D7} {$DEFINE UseAsserts} {$ENDIF}

Extra:
I would also suggest increase this to 63 instead of 15 (you 15 really fast - in practice 63 holds even when used intensively over 10+ years)

NSUBEXP = 63;

Exec with TryOnce

I will add TryOnce exec, exec which tests only at Offset (not in loop). it's needed for lexer parser which must test only one offset, outter code must change offset then (sometimes by 1, sometimes by n).

to add ATryOnce, i will change code near if reganchored <> #0 in MatchPrim

Ok naming?
ExecPos(AOffset: integer; ATryOnce: boolean)

Directory is missing in restudio

I can't compile restudio due to a missing directory.

The compiler complaines about missing unit 'tynList' which is expected in directory 'Persistence'.
Unfortunally this directory does not exist.

Can you please provide the directory (with tynlist, ansoStrings & ansoRTTIHook)?

Kind regards
Andreas

todo optimization

          if (PREOp(scan)^ = OP_EXACTLY) and
            (strlen(scan + REOpSz + RENextOffSz) >= PtrInt(Len)) then
          begin
            longest := scan + REOpSz + RENextOffSz;
            Len := strlen(longest);
          end;

strange IFDEF

{$IFDEF OverMeth}
function TRegExpr.Replace(const AInputStr: RegExprString;
  AReplaceFunc: TRegExprReplaceFunction): RegExprString;
begin
  {$IFDEF FPC}Result := {$ENDIF}
  ReplaceEx(AInputStr, AReplaceFunc);
end; { of function TRegExpr.Replace
  -------------------------------------------------------------- }
{$ENDIF}

Why result is set only for fpc???

Give error for [\1\Z]

if meta chars are not allowed in [], it is not handled, and no error shows.
e.g. try regex [\1] or [\Z] - this gives char '1' in [], char 'Z' in [].
suggestion- show error here.

            if regparse^ = EscChar then
            begin
              Inc(regparse);
              if regparse >= fRegexEnd then
              begin
                Error(reeParseAtomTrailingBackSlash);
                Exit;
              end;
              if _IsMetaChar(regparse^) then
              begin
                AddrOfString := nil;
                CanBeRange := False;
                EmitC(OpKind_MetaClass);
                EmitC(regparse^);
              end
              else
              begin
                EmitSimpleRangeC(UnQuoteChar(regparse));
               //!! error
              end;

Version prop is not needed

v. 0.947 2001.10.03
-=- (+) VersionMajor/Minor class method of TRegExpr ;)

после применения #41 предлагаю удалить эти два св-ва, какой в них смысл? только вывести что-то в окне REStudio? для программиста важен код и там в history.txt версия есть. даже REStudio может вывести эту версию без этого свойства.

CI for Unicode mode needed

subj.
для этого в тесте надо сделать define. у fpc есть параметр -dUnicode - слово после -d.
плиз, добавьте.

после #74 вы увидите тест с ф-ей TestUnicode1, которая определена в ifdef Unicode.

Support \h and \v

Maybe will make it.
https://www.regular-expressions.info/shorthand.html

While support for \d, \s, and \w is quite universal, there are some regex flavors that support additional shorthand character classes. Perl 5.10 introduced \h and \v. \h matches horizontal whitespace, which includes the tab and all characters in the "space separator" Unicode category. It is the same as [\t\p{Zs}]. \v matches "vertical whitespace", which includes all characters treated as line breaks in the Unicode standard. It is the same as [\n\cK\f\r\x85\x{2028}\x{2029}].

PCRE also supports \h and \v starting with version 7.2. PHP does as of version 5.2.2, Java as of version 8, and the JGsoft engine as of version 2. Boost supports \h starting with version 1.42. No version of Boost supports \v as a shorthand.

In many other regex flavors, \v matches only the vertical tab character. Perl, PCRE, and PHP never supported this, so they were free to give \v a different meaning. Java 4 to 7 and JGsoft V1 did use \v to match only the vertical tab. Java 8 and JGsoft V2 changed the meaning of this token anyway. The vertical tab is also a vertical whitespace character. To avoid confusion, the above paragraph uses \cK to represent the vertical tab.

Put FPC 3.0.4 to Travis on this Github

you use too old FPC

Free Pascal Compiler version 2.6.2-8 [2014/01/22] for x86_64

Copyright (c) 1993-2012 by Florian Klaempfl and others

Target OS: Linux for x86-64

Compiling testregexpr.pp

Compiling tcregexp.pas

tcregexp.pas(20,3) Fatal: Can't find unit fpwidestring used by tcregexp

ERROR: failed compiling of project /home/travis/build/andgineer/TRegExpr/test/testregexpr.lpi

Better find in char classes

После патча #90 ещё сделаю

  • запись в опкод имён мета классов , а не всей строки вида DigitChars, WordChars, SpaceChars
  • будет работать инверсные \W \S \D внутри []
  • писать диапазоны a-x как два кода а не как всю строку от начала до конца

Help on code needed

нужна подсказка. не могу понять код- мне надо менять StrScan, StrScanCI - но вызовы есть в 2х местах. какое из них менять чтобы поменять разбор в [ ] char class?

первоё в regrepeat

    OP_ANYOF:
      while (Result < TheMax) and (StrScan(opnd, scan^) <> nil) do
      begin
        Inc(Result);
        Inc(scan);
      end;
    OP_ANYBUT:
      while (Result < TheMax) and (StrScan(opnd, scan^) = nil) do
      begin
        Inc(Result);
        Inc(scan);
      end;
    OP_ANYOFCI:
      while (Result < TheMax) and (StrScanCI(opnd, scan^) <> nil) do
      begin
        Inc(Result);
        Inc(scan);
      end;
    OP_ANYBUTCI:
      while (Result < TheMax) and (StrScanCI(opnd, scan^) = nil) do
      begin
        Inc(Result);
        Inc(scan);
      end;

второе в MatchPrim

      OP_ANYOF:
        begin
          if (reginput = fInputEnd) or
            (StrScan(scan + REOpSz + RENextOffSz, reginput^) = nil) then
            Exit;
          Inc(reginput);
        end;
      OP_ANYBUT:
        begin
          if (reginput = fInputEnd) or
            (StrScan(scan + REOpSz + RENextOffSz, reginput^) <> nil) then
            Exit;
          Inc(reginput);
        end;
      OP_ANYOFCI:
        begin
          if (reginput = fInputEnd) or
            (StrScanCI(scan + REOpSz + RENextOffSz, reginput^) = nil) then
            Exit;
          Inc(reginput);
        end;
      OP_ANYBUTCI:
        begin
          if (reginput = fInputEnd) or
            (StrScanCI(scan + REOpSz + RENextOffSz, reginput^) <> nil) then
            Exit;
          Inc(reginput);
        end;

это я хочу сделать запись в опкод ПАР символов (kind, data). @andgineer

Maybe delete this getter?

function TRegExpr.GetInputString : RegExprString;
 begin
  if fInputString = '' then begin
    Error (reeGetInputStringWithoutInputString);
    EXIT;
   end;
  Result := fInputString;
 end; { of function TRegExpr.GetInputString

Ok to remove getter, or it's needed?

Drop support for old Delphi?

{$IFDEF D3} {$DEFINE UseAsserts} {$ENDIF}
{$IFDEF FPC} {$DEFINE UseAsserts} {$ENDIF}
// Define 'use subroutine parameters default values' option (do not edit this definition).
{$IFDEF D4} {$DEFINE DefParam} {$ENDIF}
{$IFDEF FPC} {$DEFINE DefParam} {$ENDIF}
// Define 'OverMeth' options, to use method overloading (do not edit this definitions).
{$IFDEF D5} {$DEFINE OverMeth} {$ENDIF}
{$IFDEF FPC} {$DEFINE OverMeth} {$ENDIF}

it's not good, maybe drop D5 and older? D4 and older? code not nice with ifdefs.

Dont support Delphi 4- in testcases

tests.pas

{$IFDEF VER130} {$DEFINE D5} {$DEFINE D4} {$DEFINE D3} {$DEFINE D2} {$ENDIF} // D5
{$IFDEF VER140} {$DEFINE D6} {$DEFINE D5} {$DEFINE D4} {$DEFINE D3} {$DEFINE D2} {$ENDIF} // D6
{$IFDEF VER150} {$DEFINE D7} {$DEFINE D6} {$DEFINE D5} {$DEFINE D4} {$DEFINE D3} {$DEFINE D2} {$ENDIF} // D7
{$IFDEF D5} {$DEFINE OverMeth} {$ENDIF}
{$IFDEF FPC} {$DEFINE OverMeth} {$ENDIF}

давайте выкинем суппорт старых делфей отсюда? эти D2, D3 D4 D5. только из теста.

Rename files in test/

I suggest to rename files in test/ dir:

  • name FPC project as test_fpc.*
  • name Delphi project as test_delphi.*
  • rename pas unit with tests to "tests.pas"
  • adjust also CI file "test.sh" and .gitignore

do you agree? @andgineer

How to refactor here? test fails

есть такая часть

              case regparse^ of // r.e.extensions
                'd':
                  EmitRangeStr('0123456789');
                'w':
                  {$IFDEF UseWordChars}
                  EmitRangeStr(WordChars);
                  {$ELSE}
                  EmitNode(OP_ANYLETTER);
                  {$ENDIF}

тут пробую сделать замену- по аналогии с \w \s

                  //EmitRangeStr('0123456789');
                  EmitNode(OP_ANYDIGIT);

но это сразу дает сбой Test11.

    (
    expression: '[^\d]+';
    inputText: '234578923457823659ARTZU38';
    substitutionText: '';
    expectedResult: 'ARTZU';
    matchStart: 19
    ),

видимо такая замена неверна для понимания [^\d] . уже нехорошо. и подозрение что тогда и код для \w тоже нехорош - он даст сбой в [^\w]. надо проверить.
что делать?

Idea how to do SmallSet optimization with new opcode

with new opcode (OpKind_Char+len, chars), (OpKind_Meta, 'w'), (OpKind_Range, a, b) it's possible to make SmallSet (set of char with number of elements<32) optimization.
plan:
collect current opcode (OpKind_...), until end of range (OpKind_End).
then analize collected data- from opcode begin until end of range.
if collected data fits into 32 char range (e.g. 'a'...'z' plus symbols) then erase opcode (decrease regcode) and replace it with SmallSet opcode.

this will need hard testing, new tests with complex ranges [a/,k-xbe-fz] etc.
ranges with metachars \w \d - cannot be changed.

Single pass of regex compiler?

@andgineer хочу попробовать сделать только 1 проход. пока их два- 1й считает размер программы, 2й уже пишет программу в буфер. как сделать только один? сразу выделять память и писать в буфер, делая ReallocMem при увеличении буфера. делаем realloc шагами по Н символов (предлагаю дать Н=100). многие выражения коротки и уместятся в 100-300 символов, что требует только 2 реаллока. добро?

[Fatal Error] RegExpr.pas(735): File not found: 'System.Character.dcu' on D7

There seems to be an issue with ifdef statements for D7:

[Fataler Fehler] RegExpr.pas(735): Datei nicht gefunden: 'System.Character.dcu'
[Fatal Error] RegExpr.pas(735): File not found: 'System.Character.dcu'

I had to comment out the following lines:
// uses
// System.Character; // System.Character exists since Delphi 2009

D7 Build 4.453, TRegExpr latest

Strange behavior: \w* and backreferences

uses RegExpr;
begin
  WriteLn( ReplaceRegExpr('(\w*)','name.ext','$1.new', True) );
  ReadLn;
end.

Return: name.new.new.ext.new.new. Bug or incorrect use?

On Russian: некоторые подробности проблемы отсюда и ниже.

Format src

I can format source code a little, by mass replaces

  • delete spaces before ( and [
  • delete spaces before : and :=
  • lowercase some keywords: Const Var
  • title case EXIT

Ok? @andgineer

[\s\S] doesn't seem to work

Using Regex "Test:\s*([\s\S]?)\s;" (without quotes, obviously) with an input of "Test: hello ;" correctly Returns "hello" on other Regex tools (e.g. http://www.regexr.com/) but returns no results using TRegExpr.

image

Using "Test:\s*(.?)\s;" works for this case in TRegExpr but obviously wouldn't do the same job if you were using a multi-line input string.

image

vs

image

Unless I'm mistaken, the below should return "hell\nlo":

image

Remove WordChars, SpaceChars props, add hardcoded Unicode checks for them

для своего CudaText хочется сделать такие правки.
предлагаю удалить проперти WordChars, SpaceChars и заменить их hardcoded checks.

и то и то делается просто и надежно. зачем вообще ввели это WordChars, я понимаю так что людям не хотелось делать полноценный анализ UnicodeData но хотелось детектить многие буквы - вот и приделали WordChars.
но это криво - оно и медленно и все буквы нереально туда записать, в юникод их очень много, многие запишут туда только латинские умляуты, а другие языки как? кто-то еще пропустит греческий, кто-то русский. а есть еще азиатские- их почти все пропустят (там Jap, Chinese, Korean, Indian итд).
hardcoded checks будут работать быстро - быстрее проверки 20-40 букв, там проверки по UnicodeData.

для SpaceChars тоже быстро, тоже проверка по UnicodeData.
проверки UnicodeData будут в ifdef unicode.

даете добро на пач?
@andgineer

String contains null character not matched properly

NUL character does not properly matched.

program testemail;

uses
    regexpr;

var
  RegexObj: TRegExpr;

begin
  RegexObj := TRegExpr.Create;

  regexObj.expression := '^(\d+):CONTENT_LENGTH\x00(\d+)\x00';
  if RegexObj.Exec('1065:CONTENT_LENGTH' + #0 + '185364' + #0 + 'SCGI'+ #0 + '1' + #0 + 'CONTENT_') then 
      WriteLn('matched!');
  RegexObj.Free;
end.

Allow NULL chars in string

I can make a patch to allow subj, for this I want to add method InBuffer which checks for offset is it in the buffer. Okay?

Why bitpacked modifiers?

function TRegExpr.GetModifier(AIndex: integer): boolean;
var
  Mask: integer;
begin
  Result := False;
  case AIndex of
    1:
      Mask := MaskModI;
    2:
      Mask := MaskModR;
    3:
      Mask := MaskModS;
    4:
      Mask := MaskModG;
    5:
      Mask := MaskModM;
    6:
      Mask := MaskModX;
  else
    begin
      Error(reeModifierUnsupported);
      Exit;
    end;
  end;
  Result := (fModifiers and Mask) <> 0;
end; { of function TRegExpr.GetModifier
  -------------------------------------------------------------- }

procedure TRegExpr.SetModifier(AIndex: integer; ASet: boolean);
var
  Mask: integer;
begin
  case AIndex of
    1:
      Mask := MaskModI;
    2:
      Mask := MaskModR;
    3:
      Mask := MaskModS;
    4:
      Mask := MaskModG;
    5:
      Mask := MaskModM;
    6:
      Mask := MaskModX;
  else
    begin
      Error(reeModifierUnsupported);
      Exit;
    end;
  end;
  if ASet then
    fModifiers := fModifiers or Mask
  else
    fModifiers := fModifiers and not Mask;
end; { of procedure TRegExpr.SetModifier
  -------------------------------------------------------------- }

@andgineer I suggest to make N bool props instead of N bits in int.

\w \W - opcode or string?

ParseAtom имеет такой код для \d \D

          'd': begin // r.e.extension - any digit ('0' .. '9')
             ret := EmitNode (ANYDIGIT);
             flagp := flagp or HASWIDTH or SIMPLE;
            end;
          'D': begin // r.e.extension - not digit ('0' .. '9')
             ret := EmitNode (NOTDIGIT);
             flagp := flagp or HASWIDTH or SIMPLE;
            end;

тут делается опкод ANYDIGIT или обратный. ОК
для \w \W делается не так - тут или опкод или Emit строки wordchars

          'w': begin // r.e.extension - any english char / digit / '_'
             {$IFDEF UseSetOfChar}
             ret := EmitRange (ANYOF);
             EmitRangeStr (WordChars);
             EmitRangeC (#0);
             {$ELSE}
             ret := EmitNode (ANYLETTER);
             {$ENDIF}
             flagp := flagp or HASWIDTH or SIMPLE;
            end;

почему не делать тут всегда опкод? это же лучше вроде - тогда UseUnicodeWordDetection отработает для этого случая тоже (а пока оно где то видимо работает а где-то нет). @andgineer

вот где юзается UseUnicodeWordDetection:

function TRegExpr.IsWordChar(AChar: REChar): Boolean;
begin
  Result := Pos(AChar, fWordChars)>0;
  {$IFDEF UnicodeWordDetection}
  If Not Result and UseUnicodeWordDetection then
    Result:=IsUnicodeWordChar(aChar);
  {$ENDIF}
end;

Small optimization: replace regdummy to bool flag

предлагаю такую мелкую оптимизацию. убрать regdummy (идет проверка if p = @regdummy чтобы понять что это первый проход - посчитать размер программы), заменить его на флаг DummyPass: boolean. чуть лучше. @andgineer ok?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.