mrabarnett / mrab-regex Goto Github PK

View Code? Open in Web Editor NEW

422.0 422.0 48.0 3.28 MB

License: Other

Python 14.29% C 85.71%

mrab-regex's People

Contributors

Stargazers

Watchers

mrab-regex's Issues

regex.search("((?i)blah)\\s+\\1", "blah BLAH") doesn't return None

Original report by Masami HIRATA (Bitbucket: msmhrt, GitHub: msmhrt).

What steps will reproduce the problem?

>>> import regex
>>> print(regex.search("((?i)blah)\\s+\\1", "blah BLAH").group(0,1))
('blah BLAH', 'blah')
>>> print(regex.search("((?i)blah)\\s+\\1", "blah BLAH"))
<_regex.Match object at 0x00C0BBB8>
>>>

What is the expected output? What do you see instead?

>>> import regex
>>> print(regex.search("((?i)blah)\\s+\\1", "blah BLAH"))
None
>>>

What version of the product are you using? On what operating system?

Windows XP Home SP3 (32-bit version)
Python 3.2.2 (default, Sep 4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win
32
regex 0.1.20120114

Please provide any additional information below.

not all keywords are found by named list with overlapping keywords when full Unicode casefolding is required

Original report by Anonymous.

What steps will reproduce the problem?

>>> import regex
>>> p = regex.compile(ur'(?fi)\L<keywords>', keywords=['post','pos'])
>>> p.findall(u'POST, Post, post, poſt, poﬆ, and poﬅ')

What is the expected output? What do you see instead?

Expected:

[u'POST', u'Post', u'post', u'po\u017ft', u'po\ufb06', u'po\ufb05']

Got:

[u'POST', u'Post', u'post', u'po\u017ft']

What version of the product are you using? On what operating system?

regex.version == '2.4.0'
sys.version_info == (2, 6, 5, 'final', 0)
platform.platform() == 'Linux-3.0.0-15-generic-x86_64-with-Ubuntu-11.10-oneiric'

Please provide any additional information below.

>>> p = regex.compile(ur'(?fi)pos|post')
>>> p.findall(u'POST, Post, post, poſt, poﬆ, and poﬅ')
[u'POS', u'Pos', u'pos', u'po\u017f']
>>> p = regex.compile(ur'(?fi)post|pos')
>>> p.findall(u'POST, Post, post, poſt, poﬆ, and poﬅ')
[u'POST', u'Post', u'post', u'po\u017ft']
>>> p = regex.compile(ur'(?fi)post|another')
>>> p.findall(u'POST, Post, post, poſt, poﬆ, and poﬅ')
[u'POST', u'Post', u'post', u'po\u017ft', u'po\ufb06', u'po\ufb05']

regex.compile("a#comment\n*", flags=regex.X) causes "_regex_core.error: nothing to repeat"

Original report by Masami HIRATA (Bitbucket: msmhrt, GitHub: msmhrt).

What steps will reproduce the problem?

>>> import regex
>>> regex.compile("a#comment\n*", flags=regex.X)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python32\3.2.2\lib\site-packages\regex.py", line 289, in compile
    return _compile(pattern, flags, kwargs)
  File "C:\Python32\3.2.2\lib\site-packages\regex.py", line 423, in _compile
    parsed = parse_pattern(source, info)
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 339, in parse_
pattern
    branches = [parse_sequence(source, info)]
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 358, in parse_
sequence
    item = parse_item(source, info)
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 368, in parse_
item
    element = parse_element(source, info)
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 709, in parse_
element
    raise error("nothing to repeat")
_regex_core.error: nothing to repeat
>>>

What is the expected output? What do you see instead?

>>> import regex
>>> regex.compile("a#comment\n*", flags=regex.X)
regex.Regex('a#comment\n*', flags=regex.X | regex.V0)
>>> regex.search("a#comment\n*", "aaa", flags=regex.X).group(0)
'aaa'
>>>

What version of the product are you using? On what operating system?

Windows XP Home SP3 (32-bit version)
Python 3.2.2 (default, Sep 4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win
32
regex 0.1.20120122

Please provide any additional information below.

non-capturing group (around surrogate character sets) cause recursion error

Original report by Anonymous.

I just encountered some cornercase while using the non-capturing group (?: ...), which leads to recursion error. The group i tested contains character sets of surrogate characters, I am not sure, whether this is relevant for the problem.
the same pattern works as expected with normal parentheses and without any grouping. cf.

>>> regex.findall(ur"(?s)(?:[\ud800-\udbff][\udc00-\udfff])", u"a𐀀bcdefg")
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "regex.pyc", line 253, in findall
  File "regex.pyc", line 394, in _compile
  File "_regex_core.pyc", line 2187, in fix_groups
  File "_regex_core.pyc", line 2187, in fix_groups
  File "_regex_core.pyc", line 2187, in fix_groups
...
  File "_regex_core.pyc", line 2187, in fix_groups
  File "_regex_core.pyc", line 2187, in fix_groups
  File "_regex_core.pyc", line 2187, in fix_groups
  File "_regex_core.pyc", line 2187, in fix_groups
RuntimeError: maximum recursion depth exceeded
>>> regex.findall(ur"(?s)[\ud800-\udbff][\udc00-\udfff]", u"a𐀀bcdefg")
[u'\U00010000']
>>> regex.findall(ur"(?s)([\ud800-\udbff][\udc00-\udfff])", u"a𐀀bcdefg")
[u'\U00010000']
>>>

re can handle all these pattern normally:

>>> re.findall(ur"(?s)([\ud800-\udbff][\udc00-\udfff])", u"a𐀀bcdefg")
[u'\U00010000']
>>> re.findall(ur"(?s)[\ud800-\udbff][\udc00-\udfff]", u"a𐀀bcdefg")
[u'\U00010000']
>>> re.findall(ur"(?s)([\ud800-\udbff][\udc00-\udfff])", u"a𐀀bcdefg")
[u'\U00010000']
>>>

Using regex-0.1.20110616, win 7, python 2.7.2 (the official, narrow unicode build, of course)

vbr

support concatenation of compiled patterns -- feature

Original report by Anonymous.

With the new named lists feature, a compiled pattern may contain both the pattern string, and references to one or more lists. This means that the standard way of composing patterns, concatenating the .pattern attribute of compiled patterns with other text, won't "just work".

Let me suggest an API modification to get around this. Instead of the "pattern" parameter to regex.compile() being a string, allow it to be either a string, or a sequence of objects, each of which must be either a string or a compiled pattern. The elements of the sequence are concatenated to generate the new pattern string, and list references in the compiled pattern elements of the sequence are transferred into the new compiled pattern. Duplicate names for list references could be an error; or, they could be handled automatically by name-mangling the list names within the aggregate compiled pattern.

Support for regex in Property Values

Original report by Anonymous.

I just found a chapter in the Unicode guidelines for regular expressions concerning property values and would like to ask, whether supporting of regex patterns (or some subset thereof) here would be possible.
http://unicode.org/reports/tr18/#Wildcard_Properties

I am not aware of any implementation already supporting it, nor do I know how much extra complexity for the parser would be needed, but it looks like a feature orthogonal with unicode properties and the set operations which regex already has.

I see, that the usecases would cover rather special approaches - in my case it would allow for investigating the unicode character repertoire itself (Currently I can do something like that after grabbing all the character names via unicodedata).

Otherwise, on "normal" text, the cases could be covered, where there are multiple character ranges, that should be considered (i.e. basic xxx, xxx supplement, xxx extended ...). (I am not sure how the current Script property relates to this exactly.)
E.g. some errors or even spoofing attempts might be checked for on graphically similar characters from different ranges. cf.

(dec: 111; hex: 0x6f) LATIN SMALL LETTER O
(dec: 959; hex: 0x3bf) GREEK SMALL LETTER OMICRON
(dec: 1086; hex: 0x43e) CYRILLIC SMALL LETTER O
(dec: 1413; hex: 0x585) ARMENIAN SMALL LETTER OH

I'd like to stress, that this is only meant as proposal for consideration - it surely wouldn't be worth some extensive effort or the risk for being possible bug source.

Regards
Vlastimil Brom

regex.compile("qu", flags=regex.I|regex.V1) doesn't match "qu"

Original report by Masami HIRATA (Bitbucket: msmhrt, GitHub: msmhrt).

What steps will reproduce the problem?

$ python3
Python 3.2.2 (default, Dec 23 2011, 15:22:48) 
[GCC 4.2.1 (Apple Inc. build 5664)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import regex
>>> r = regex.compile("qu", flags=regex.I|regex.V1)
>>> r
regex.Regex('qu', flags=regex.F | regex.I | regex.V1)
>>> print(r.match("qu"))
None
>>>

What is the expected output? What do you see instead?

>>> import regex
>>> r = regex.compile("qu", flags=regex.I|regex.V1)
>>> print(r.match("qu").group(0))
qu

What version of the product are you using? On what operating system?

Mac OS X 10.6.8
Python 3.2.2
regex 0.1.20120103

Please provide any additional information below.

I can't reproduce this issue by following steps

>>> import regex
>>> r = regex.compile("qu")
>>> print(r.match("qu").group(0))
qu
>>> r = regex.compile("qu", flags=regex.I)
>>> print(r.match("qu").group(0))
qu
>>> r = regex.compile("qu", flags=regex.V1)
>>> print(r.match("qu").group(0))
qu

regex.search("([\da-f:]+)$", "E", regex.I|regex.V1) returns None

Original report by Masami HIRATA (Bitbucket: msmhrt, GitHub: msmhrt).

What steps will reproduce the problem?

Python 3.2.2 (default, Sep  4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win
32
Type "help", "copyright", "credits" or "license" for more information.
>>> import regex
>>> print(regex.search("([\da-f:]+)$", "E", regex.I|regex.V1))
None
>>>

What is the expected output? What do you see instead?

>>> import regex
>>> print(regex.search("([\da-f:]+)$", "E", regex.I|regex.V1))
<_regex.Match object at 0x00C09B10>
>>> print(regex.search("([\da-f:]+)$", "E", regex.I|regex.V1).group(0))
E
>>>

What version of the product are you using? On what operating system?

Windows XP Home SP3 (32-bit version)
Python 3.2.2
regex 0.1.20120112

Please provide any additional information below.

# "e" is ok.
>>> import regex
>>> print(regex.search("([\da-f:]+)$", "e", regex.I|regex.V1))
<_regex.Match object at 0x00C09B10>
>>> print(regex.search("([\da-f:]+)$", "e", regex.I|regex.V1).group(0))
e
>>>

adding the set operations and possibly the supported properties to the help text

Original report by Anonymous.

Hi,

first of all, many thanks for further excellent additions to regex, such as the extended unicode properties and newly the set operations!

I'd like to ask for some help text additions in this respect.

Could the Features.rst / Features.html be somehow accessible programmatically from within the library? Especially the parts: "Unicode codepoint properties, including scripts and blocks" and "Set operators" may be good additions as these features are probably less known. Maybe the set operations syntax coud be added briefly to the initial part of the help: "The special characters are:" under "[]"

Furthermore, there could ideally be a reference for the multitude of the supported unicode properties.

However, I can see, that the help text might get too large. Maybe the links to the respective data (the unicode standard etc.) migt be more appropriate. (Some of these migt be eventually added to the interface of unicodedata, but that's not relevant here.)

(now using regex-0.1.20110514.tar.gz, python 2.7, on win 7)

Thanks again
vbr

regex.compile("^((?>\w+)|(?>\s+))*$") causes "TypeError: 'GreedyRepeat' object is not iterable"

Original report by Masami HIRATA (Bitbucket: msmhrt, GitHub: msmhrt).

What steps will reproduce the problem?

python3
Python 3.2.2 (default, Dec 23 2011, 15:22:48) 
[GCC 4.2.1 (Apple Inc. build 5664)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import regex
>>> regex.compile("^((?>\w+)|(?>\s+))*$", flags=regex.V1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/regex.py", line 289, in compile
    return _compile(pattern, flags, kwargs)
  File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/regex.py", line 452, in _compile
    parsed = parsed.optimise(info)
  File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/_regex_core.py", line 2805, in optimise
    s = s.optimise(info)
  File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/_regex_core.py", line 2450, in optimise
    subpattern = self.subpattern.optimise(info)
  File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/_regex_core.py", line 2514, in optimise
    subpattern = self.subpattern.optimise(info)
  File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/_regex_core.py", line 1813, in optimise
    prefix, branches = Branch._split_common_prefix(info, branches)
  File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/_regex_core.py", line 1903, in _split_common_prefix
    while pos < end_pos and prefix[pos].can_be_affix() and all(a[pos] ==
  File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/_regex_core.py", line 1731, in can_be_affix
    return all(s.can_be_affix() for s in self.subpattern)
TypeError: 'GreedyRepeat' object is not iterable
>>> try:
...     regex.compile("^((?>\w+)|(?>\s+))*$")
... except regex.error:
...     print("Wrong regexp!")
... 
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/regex.py", line 289, in compile
    return _compile(pattern, flags, kwargs)
  File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/regex.py", line 452, in _compile
    parsed = parsed.optimise(info)
  File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/_regex_core.py", line 2805, in optimise
    s = s.optimise(info)
  File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/_regex_core.py", line 2450, in optimise
    subpattern = self.subpattern.optimise(info)
  File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/_regex_core.py", line 2514, in optimise
    subpattern = self.subpattern.optimise(info)
  File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/_regex_core.py", line 1813, in optimise
    prefix, branches = Branch._split_common_prefix(info, branches)
  File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/_regex_core.py", line 1903, in _split_common_prefix
    while pos < end_pos and prefix[pos].can_be_affix() and all(a[pos] ==
  File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/_regex_core.py", line 1731, in can_be_affix
    return all(s.can_be_affix() for s in self.subpattern)
TypeError: 'GreedyRepeat' object is not iterable
>>>

What is the expected output? What do you see instead?

>>> import regex
>>> regex.compile("^((?>\w+)|(?>\s+))*$", flags=regex.V1)
regex.Regex('^((?>\\w+)|(?>\\s+))*$', flags=regex.F | regex.V1)
>>> try:
...     regex.compile("^((?>\w+)|(?>\s+))*$")
... except regex.error:
...     print("Wrong regexp!")
... 
Wrong regexp!
>>>

What version of the product are you using? On what operating system?

Mac OS X 10.6.8
Python 3.2.2
regex 0.1.20111223

Please provide any additional information below.

# The case of Python 3.2.2 standard re module
>>> import re
>>> try:
...     re.compile("^((?>\w+)|(?>\s+))*$")
... except re.error:
...     print("Wrong regexp!")
... 
Wrong regexp!
>>>

Setup.py is missing a reference to _regex_unicode.c

Original report by Anonymous.

When I try to import regex I get the following import error:

dlopen(<snipped>/python2.5/site-packages/_regex.so, 2): Symbol not found: _re_is_same_char_ign
  Referenced from: <snipped>/python2.5/site-packages/_regex.so
  Expected in: dynamic lookup

Changing setup.py line 57 like below fixes this:

ext_modules=[Extension('_regex', [os.path.join(PKG_BASE, '_regex.c'), os.path.join(PKG_BASE, '_regex_unicode.c')])]

can't build the project

Original report by Anonymous.

I thought I'd try building the module, and expand the test cases a bit, but I can't seem to find the setup.py file.

I did the

hg clone https://mrab-regex-hg.googlecode.com/hg/ mrab-regex-hg

Is there some trick to building it?

fuzzy patterns in negative lookarounds - case sensitivity difference

Original report by Anonymous.

Hi,

recently I used the fuzzy matching capability to detect some misspellings; I used lookarounds to filter out the correct forms.
In some cases I saw some differences I can't understand, it appears, there may be some correlation with the i and V0/V1 flags. (using regex 0.1.20111014; py 2.7.2; win XP)

# caseless negative lookbehind in V1 somehow doesn't filter out e==1 matches
>>> regex.findall(r"(?iV1)\m(?:word){e<=3}\M(?<!\m(?:word){e<=1}\M)", "word word2 word word3 word word234 word23 word")
['word2', 'word3', 'word234', 'word23']

# while in case-insensitive mode this works as expected
>>> regex.findall(r"(?V1)\m(?:word){e<=3}\M(?<!\m(?:word){e<=1}\M)", "word word2 word word3 word word234 word23 word")
['word234', 'word23']
>>> 
>>> 
# the - hopefully - equivalent lookahaeds work both the same 
>>> regex.findall(r"(?iV1)(?!\m(?:word){e<=1}\M)\m(?:word){e<=3}\M", "word word2 word word3 word word234 word23 word")
['word234', 'word23']
>>> regex.findall(r"(?V1)(?!\m(?:word){e<=1}\M)\m(?:word){e<=3}\M", "word word2 word word3 word word234 word23 word")
['word234', 'word23']
>>> 

# the original above lookbehinds both work with V0 flag 
>>> regex.findall(r"(?V0)\m(?:word){e<=3}\M(?<!\m(?:word){e<=1}\M)", "word word2 word word3 word word234 word23 word")
['word234', 'word23']
>>> regex.findall(r"(?iV0)\m(?:word){e<=3}\M(?<!\m(?:word){e<=1}\M)", "word word2 word word3 word word234 word23 word")
['word234', 'word23']
>>>

Unless I have made some stupid mistake, the patterns should match the variants of "word" with at least 1 and at most 3 errors. Are there some other aspects to consider if using lookarounds with the fuzzy patterns?

BTW, a probably silly idea originated from this kind of searches - how about adding support for only erroneous matches via ... {1<=e<=3} ?

(I see, it could get quite complex with some more sophisticated error costs arithmetics.)

However, I am glad, the same could be done with lookarounds.

(As a side note, is this kind of possible bug report appropriate for a separate issue, or would it have been better within the "approximate matching" issue?)

regards,
vbr

regex 0.1.20110514 findall overlapped not working with 'start of string' expression

Original report by Anonymous.

Apologies if this is again not the right place to post this

I'm trying to use regex 0.1.2011051 with the overlapped=True feature

It works great, unless I have the 'start of string' (caret) character in my regular expression:

>>> regex.findall(r"a.*b","abadalaba",overlapped=True)
['abadalab', 'adalab', 'alab', 'ab']
>>> regex.findall(r"^a.*b","abadalaba",overlapped=True)
['abadalab']

If I understand correctly, the second regexp should also produce the same results as the first one, since all the results are at the beginning of the string

Recursive patterns

Original report by Anonymous.

In order to parse certain strings belonging to certain kinds of languages, recursive patterns for recursive matching are necessary. For example, if you want to parse and capture a parenthesized chunk of text which may contain nested parenthesis, you'll need to have recursion.

Perl already supports recursive matching, so does PCRE and with that many libraries for languages such as PHP. I always wondered why the Python regex engines never did.

Please refer to http://perldoc.perl.org/perlre.html for documentation on the widely accepted syntax for recursive patterns.

hg EOL extension needs to be used and configured

Original report by Anonymous.

All developers need to use the http://mercurial.selenic.com/wiki/EolExtension to deal with cross platform line ending conversion issues.

[extensions]
eol =

In their own .hgrc file.

This repo also needs the attached file checked in as a .hgeol file.

Patch to restore speed lost in commit 7abd9f9bb1

Original report by Anonymous.

The attached diff against c0186afe8c50 restores the speed lost in 7abd9f9bb1 for me (25% faster on my regression tests). In addition, all my tests still pass without error.

Looking at the diff, I confirmed empirically that it's the conditional:

if (pattern->repeat_info[i].inner)

that seems to slow everything down. I don't why this test and dereference would be so expensive; it might very well be a gcc bug (I haven't compared the generated assembly yet).

regex.search("a(bc)d", "abcd", regex.I|regex.V1) returns None

Original report by Masami HIRATA (Bitbucket: msmhrt, GitHub: msmhrt).

What steps will reproduce the problem?

C:\Python32\3.2.2\Scripts>python.exe
Python 3.2.2 (default, Sep  4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win
32
Type "help", "copyright", "credits" or "license" for more information.
>>> import regex
>>> print(regex.search("a(bc)d", "abcd", regex.I|regex.V1))
None

What is the expected output? What do you see instead?

>>> import regex
>>> print(regex.search("a(bc)d", "abcd", regex.I|regex.V1))
<_regex.Match object at 0x00C09B10>
>>> print(regex.search("a(bc)d", "abcd", regex.I|regex.V1).group(0))
abcd
>>>

What version of the product are you using? On what operating system?

Windows XP Home SP3 (32-bit version)
Python 3.2.2
regex 0.1.20120112

Please provide any additional information below.

regex.search("^(?=ab(de))(abd)(e)", "abde").groups() returns (None, 'abd', 'e') instead of ('de', 'abd', 'e')

Original report by Masami HIRATA (Bitbucket: msmhrt, GitHub: msmhrt).

What steps will reproduce the problem?

>>> import regex
>>> print(regex.search("^(?=ab(de))(abd)(e)", "abde").groups())
(None, 'abd', 'e')
>>>

What is the expected output? What do you see instead?

>>> import regex
>>> print(regex.search("^(?=ab(de))(abd)(e)", "abde").groups())
('de', 'abd', 'e')
>>>

What version of the product are you using? On what operating system?

Windows XP Home SP3 (32-bit version)
Python 3.2.2 (default, Sep 4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win
32
regex 0.1.20120112

Please provide any additional information below.

regex ver.0.1.20120103 is OK.

regex.search("^(a){0,0}", "abc").group(0,1) returns ('a', 'a') instead of ('', None)

Original report by Masami HIRATA (Bitbucket: msmhrt, GitHub: msmhrt).

What steps will reproduce the problem?

>>> import regex
>>> print(regex.search("^(a){0,0}", "abc").group(0,1))
('a', 'a')
>>>

What is the expected output? What do you see instead?

>>> import regex
>>> print(regex.search("^(a){0,0}", "abc").group(0,1))
('', None)
>>>

What version of the product are you using? On what operating system?

Windows XP Home SP3 (32-bit version)
Python 3.2.2 (default, Sep 4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win
32
regex 0.1.20120112

Please provide any additional information below.

Broken pattern

Original report by Anonymous.

re.search('Derde\s*:', 'aaaaaa:\nDerde:')

matches on Python 2.6.5 (Linux x86_64) using standard re module, doesn't match using regex from trunk rev b05186807a

This one is very weird, for example:

re.search('Derde\s*:', 'aaaaa:\nDerde:')

matches just fine using both modules....

property database - aliases (question)

Original report by Anonymous.

Well it doesn't realy count as a normal usage of the regex module, but I tried (again) to toy with its data structures, e.g. to get some unicode properties not available in unicodedata now.

(It is not that much practical, as I can just grab the unicode datafiles for searching, but I also wanted to dig in the (python) source of regex a bit...)

Sofar, I could make a check for unicode property (hopefully :-) work,
is it just e.g.:

>>> _regex.has_property_value((_regex.get_properties()["SCRIPT"][0] << 16) | _regex.get_properties()["SCRIPT"][1]["GREEK"], ord(u"Σ"))
1
>>> 
?

Is it the case, that there is no other access path, i.e. for getting some property to a given character, one has to check each property for every possible value and collect the successful matches? (Actually, it works surprisingly fast, given how clumsy approach this is.)

(It is really a kind of exercise, I wouldn't want to ask for a more comfortable access to this data, you already offered, as this belongs to unicodedata.)

On a related note, is it somhow possible to programatically access the original, not normalised property names and values? - as listed on:

https://code.google.com/p/mrab-regex-hg/wiki/UnicodeProperties

It is possible to collect the aliases belonging to each other and take the longest ones as full forms, but the casing and spaces probably can't be recovered, can they?

Sorry for this possibly irrelevant "issue" (as having an issue-type "question", would likely by silly...

And, of course, many thanks for the recent enhancements and fixes!

regards,
vbr

Change NEW flag

Original report by Anonymous.

Feature request: please change the NEW flag to something else. In five or six years (give or take), the re module will be long forgotten, compatibility with it will not be needed, so-called "new" features will no longer be new, and the NEW flag will just be silly.

If you care about future compatibility, some sort of version specification would be better, e.g. "VERSION=0" (current re module), "VERSION=1" (this regex module), "VERSION=2" (next generation). You could then default to VERSION=0 for the first few releases, and potentially change to VERSION=1 some time in the future.

Otherwise, I suggest swapping the sense of the flag: instead of "re behaviour unless NEW flag is given", I'd say "re behaviour only if OLD flag is given". (Old semantics will, of course, remain old even when the new semantics are no longer new.)

(Copied from http://bugs.python.org/issue2636)

regex.compile("a(?x: b c )d") causes "_regex_core.error: missing )"

Original report by Masami HIRATA (Bitbucket: msmhrt, GitHub: msmhrt).

What steps will reproduce the problem?

>>> import regex
>>> regex.compile("a(?x: b c )d")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python32\3.2.2\lib\site-packages\regex.py", line 289, in compile
    return _compile(pattern, flags, kwargs)
  File "C:\Python32\3.2.2\lib\site-packages\regex.py", line 423, in _compile
    parsed = parse_pattern(source, info)
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 339, in parse_
pattern
    branches = [parse_sequence(source, info)]
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 358, in parse_
sequence
    item = parse_item(source, info)
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 368, in parse_
item
    element = parse_element(source, info)
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 660, in parse_
element
    element = parse_paren(source, info)
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 796, in parse_
paren
    return parse_flags_subpattern(source, info)
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 1014, in parse
_flags_subpattern
    source.expect(")")
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 3439, in expec
t
    raise error("missing {}".format(substring))
_regex_core.error: missing )
>>>

What is the expected output? What do you see instead?

>>> import regex
>>> regex.compile("a(?x: b c )d")
regex.Regex('a(?x: b c )d', flags=regex.V0)
>>> regex.search("a(?x: b c )d", "abcd").group(0)
'abcd'
>>>

What version of the product are you using? On what operating system?

Windows XP Home SP3 (32-bit version)
Python 3.2.2 (default, Sep 4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win
32
regex 0.1.20120122

Please provide any additional information below.

regex.compile("\\ ", regex.X) causes "_regex_core.error: bad escape"

Original report by Masami HIRATA (Bitbucket: msmhrt, GitHub: msmhrt).

What steps will reproduce the problem?

>>> import regex
>>> regex.compile("\\ ", regex.X)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python32\3.2.2\lib\site-packages\regex.py", line 289, in compile
    return _compile(pattern, flags, kwargs)
  File "C:\Python32\3.2.2\lib\site-packages\regex.py", line 423, in _compile
    parsed = parse_pattern(source, info)
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 335, in parse_
pattern
    branches = [parse_sequence(source, info)]
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 351, in parse_
sequence
    item = parse_item(source, info)
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 364, in parse_
item
    element = parse_element(source, info)
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 585, in parse_
element
    return parse_escape(source, info, False)
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 1012, in parse
_escape
    raise error("bad escape")
_regex_core.error: bad escape
>>>

What is the expected output? What do you see instead?

>>> import regex
>>> print(regex.compile("\\ ", regex.X|regex.D))
CHARACTER MATCH 32
regex.Regex('\\ ', flags=regex.D | regex.X | regex.V0)
>>>

What version of the product are you using? On what operating system?

Windows XP Home SP3 (32-bit version)
Python 3.2.2 (default, Sep 4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win
32
regex 0.1.20120112

Please provide any additional information below.

syntax for beginning and end of the word?

Original report by Anonymous.

Hi,

I just noticed the addition for beginning/end of the word and I appreciate it very much.

I just wanted to ask, what does maybe "m" in \m and \M stand for.?

I'd have expected rather
\< and \> (maybe because of OpenOffice), but it turns out, that \m \M are used in TCL and that the other syntax isn't much more used either:

http://www.regular-expressions.info/refflavors.html

[I must have somehow missed that page sofar; it's nice to see, that regex now have most of the "useful" features (marked as ommissions in re) and some extras ...]

Would it be possible to alias this alternative syntax too, or are there some drawbacks with it, or is it considered "line noise"?

Just an aside: Am I supposed to be able to set the issue type (such as Feature request etc., or is it always "Defect" which could be changed by the administrators later?

Thanks and regards,
vbr

regex.compile("(?>b)") causes "TypeError: 'Character' object is not subscriptable"

Original report by Masami HIRATA (Bitbucket: msmhrt, GitHub: msmhrt).

What steps will reproduce the problem?

$ python3
Python 3.2.2 (default, Dec 23 2011, 15:22:48) 
[GCC 4.2.1 (Apple Inc. build 5664)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import regex
>>> regex.compile("(?>b)", flags=regex.V1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/regex.py", line 289, in compile
    return _compile(pattern, flags, kwargs)
  File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/regex.py", line 452, in _compile
    parsed = parsed.optimise(info)
  File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/_regex_core.py", line 1712, in optimise
    prefix, subpattern = Atomic._split_atomic_prefix(subpattern)
  File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/_regex_core.py", line 1765, in _split_atomic_prefix
    prefix, subpattern = prefix[ : count], subpattern[count : ]
TypeError: 'Character' object is not subscriptable
>>> try:
...     regex.compile("(?>b)")
... except regex.error:
...     print("Wrong Regexp!")
... 
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/regex.py", line 289, in compile
    return _compile(pattern, flags, kwargs)
  File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/regex.py", line 452, in _compile
    parsed = parsed.optimise(info)
  File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/_regex_core.py", line 1712, in optimise
    prefix, subpattern = Atomic._split_atomic_prefix(subpattern)
  File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/_regex_core.py", line 1765, in _split_atomic_prefix
    prefix, subpattern = prefix[ : count], subpattern[count : ]
TypeError: 'Character' object is not subscriptable
>>>

What is the expected output? What do you see instead?

>>> import regex
>>> regex.compile("(?>b)", flags=regex.V1)
regex.Regex('(?>b)', flags=regex.F | regex.V1)
>>> try:
...     regex.compile("(?>b)")
... except regex.error:
...     print("Wrong RegExp!")
... 
Wrong RegExp!
>>>

What version of the product are you using? On what operating system?

Mac OS X 10.6.8
Python 3.2.2
regex 0.1.20111223

Please provide any additional information below.

# The case of Python 3.2.2 standard re module
>>> import re
>>> try:
...     re.compile("(?>b)")
... except re.error:
...     print("Wrong RegExp!")
... 
Wrong RegExp!
>>>

approximate matching -- feature request

Original report by Anonymous.

I'm currently using the TRE regex engine to match output from OCR, because it supports approximate matching. Very useful. Would be nice to have that capability in Python regex, as well.

different handling of \w in unicode patterns in regex and re

Original report by Anonymous.

Hi,

I think, it may be an intended behaviour, but I did't find it mentioned anywhere in the docs. Sorry, if it is already discussed somewhere I haven't looked ...

It seems, that in the unicode patterns like ur"..." regex implicitely sets the unicode flag (?u), while re doesn't seem to do that.

>>> re.findall(ur"\w", u"aáb")
[u'a', u'b']
>>> regex.findall(ur"\w", u"aáb")
[u'a', u'\xe1', u'b']
>>> re.findall(r"\w", u"aáb")
[u'a', u'b']
>>> regex.findall(r"\w", u"aáb")
[u'a', u'b']
>>> re.findall(ur"(?u)\w", u"aáb")
[u'a', u'\xe1', u'b']
>>> regex.findall(ur"(?u)\w", u"aáb")
[u'a', u'\xe1', u'b']
>>>

Python 2.7.1, win XPp SP3, 32 bit Czech; regex r902c02d44f

regards,
Vlastimil Brom

atomic and normal groups in recursive patterns

Original report by Anonymous.

First, many thanks for adding support for recursive patterns to regex!

While playing with this feature - mainly in the context of balancing substrings, I found some disproportion I can't understand. (I suspect, this issue ought to be of the type "question" rather than report, but anyway.)

I am trying a pattern for possibly multiply parenthesised substrings; a version with an atomic group works, as I would expect:

>>> regex.findall(r"\((?:(?>[^()]+)|(?R))*\)", "a(bcd(e)f)g(h)")
['(bcd(e)f)', '(h)']

However, using a normal group the result is:

>>> regex.findall(r"\((?:([^()]+)|(?R))*\)", "a(bcd(e)f)g(h)")
['f', 'h']
>>>

I confess, I am not fully able to trace the (not-)backtracking behaviour in detail, but I can't understand, how there can be a match without the outer "(" ")", which appear to be an unconditional part of the pattern. Or is possibly something else than whole matches returned from findall()?

Regarding the above examples - I am trying to find a pattern for nested elements - parentheses in this simplified example - which would work for balanced "subexpressions" in both directions, so that the longest balanced part would be found.
e.g.:

>>> regex.findall(r"\((?:(?>[^()]+)|(?R))*\)", "a(b(cd)e)f)g)h")
['(b(cd)e)']

works Ok, the superflous closing parentheses are ignored;
however:

>>> regex.findall(r"\((?:(?>[^()]+)|(?R))*\)", "a(bc(d(e)f)gh")
[]

Is the right way to ignore the superfluos opening parentheses to use reversed search?:

>>> regex.findall(r"(?r)\((?:(?>[^()]+)|(?R))*\)", "a(bc(d(e)f)gh")
['(d(e)f)']
>>>

Or is there maybe a pattern, which would match in both cases unchanged?

(The next step would be to find the *not*-balanced elements, if it would be at all possible only using regex.)
(the above tests on Win XP, Python 2.7.2, regex-0.1.20120105)

Thanks and regards,
vbr

regex.search("(a(?(1)\\1)){4}", "a"*10, flags=regex.V1).group(0,1) returns ('aaaaa', 'a') instead of ('aaaaaaaaaa', 'aaaa')

Original report by Masami HIRATA (Bitbucket: msmhrt, GitHub: msmhrt).

What steps will reproduce the problem?

>>> import regex
>>> print(regex.search("(a(?(1)\\1)){4}", "a"*10, flags=regex.V1).group(0,1))
('aaaaa', 'a')
>>>

What is the expected output? What do you see instead?

>>> import regex
>>> print(regex.search("(a(?(1)\\1)){4}", "a"*10, flags=regex.V1).group(0,1))
('aaaaaaaaaa', 'aaaa')
>>>

What version of the product are you using? On what operating system?

Windows XP Home SP3 (32-bit version)
Python 3.2.2 (default, Sep 4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win
32
regex 0.1.20120123

Please provide any additional information below.

The following works fine:

>>> print(regex.search("(a(?(1)\\1)){1}", "a"*10, flags=regex.V1).group(0,1))
('a', 'a')
>>> print(regex.search("(a(?(1)\\1)){2}", "a"*10, flags=regex.V1).group(0,1))
('aaa', 'aa')
>>>

negated unicode properties in case-insensitive mode

Original report by Anonymous.

While trying to test some of the recently listed properties supported by regex, it appears to me, that the negated properties don't work in case insensitive search; cf.:

>>> regex.findall(ur"(?i)\P{InBasicLatin}",u"aáb")
[u'a', u'b']
>>> regex.findall(ur"(?i)\p{InBasicLatin}",u"aáb")
[u'a', u'b']
>>> 
>>> regex.findall(ur"\P{InBasicLatin}",u"aáb")
[u'\xe1']
>>> regex.findall(ur"\p{InBasicLatin}",u"aáb")
[u'a', u'b']
>>>

as if the negated property literal \P would somehow taken in lowercase (?)

some other literals don't seem to be affected, e.g.

>>> regex.findall(ur"\s",u"a b\tcd")
[u' ', u'\t']
>>> regex.findall(ur"\S",u"a b\tcd")
[u'a', u'b', u'c', u'd']
>>> 
>>> regex.findall(ur"(?i)\s",u"a b\tcd")
[u' ', u'\t']
>>> regex.findall(ur"(?i)\S",u"a b\tcd")
[u'a', u'b', u'c', u'd']
>>>

works as expected.

Regards,
vbr

locale flag behaviour - independent of locale.setlocale()

Original report by Anonymous.

Hi,

I hope, I am not missing anything trivial, I just noticed a behaviour of the LOCALE flag I can't understand; however, both re and regex behave the same in this respect:

I thought, the search pattern (?L)\w would match any of the respective string.letters according to the current locale (and possibly additionally [0-9_]).

However, the locale doesn't seem to be reflected.

>>> unicode_BMP = " " + "".join(unichr(i)for i in range(1, 0xFFFF+1))

>>> locale.setlocale(locale.LC_ALL, "")
'Czech_Czech Republic.1250'
>>> print unicode(string.letters, "windows-1250")
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzŠŚŤŽŹšśťžźŁĄŞŻłµąşĽľżŔÁÂĂÄĹĆÇČÉĘËĚÍÎĎĐŃŇÓÔŐÖŘŮÚŰÜÝŢßŕáâăäĺćçčéęëěíîďđńňóôőöřůúűüýţ
>>> locale.setlocale(locale.LC_ALL, "Greek")
'Greek_Greece.1253'
>>> print unicode(string.letters, "windows-1253")
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzƒΆµΈΉΊΌΎΏΐΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩΪΫάέήίΰαβγδεζηθικλμνξοπρςστυφχψωϊϋόύώ
>>> 
>>> import re
>>> import regex
>>> locale.setlocale(locale.LC_ALL, "")
'Czech_Czech Republic.1250'
>>> print "".join(re.findall(r"(?L)\w", unicode_BMP))
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz����������£¥ª¯³µ¹º¼¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþ
>>> print "".join(regex.findall(r"(?L)\w", unicode_BMP))
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz����������£¥ª¯³µ¹º¼¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþ
>>> 

>>> locale.setlocale(locale.LC_ALL, "Greek")
'Greek_Greece.1253'
>>> print "".join(re.findall(r"(?L)\w", unicode_BMP))
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz�¢²³µ¸¹º¼¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþ
>>> print "".join(regex.findall(r"(?L)\w", unicode_BMP))
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz�¢²³µ¸¹º¼¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþ
>>>

it seems that the nearest letter set to the result of the re/regex LOCALE flags migt be ascii or US locale:

>>> locale.setlocale(locale.LC_ALL, "US")
'English_United States.1252'
>>> print unicode(string.letters, "windows-1252")
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzƒŠŒŽšœžŸªµºÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ
>>>

however, there are some differences too, namely between z� and À
re/regex (?L)\w :
z�¢²³µ¸¹º¼¾¿À (as displayed in wxPython shell)
z��£¥ª¯³µ¹º¼¾¿À (as displayed in tkinter Idle shell)
(in either case, there are some items, one wouldn't consider usual word characters, cf. ¿)
US string.letters"
zƒŠŒŽšœžŸªµºÀ (displayed identically in both shells)

There are likely some other issues (like some encoding/displaying peculiarities in wx and Tkinter), but the regex matching using the LOCALE flag clearly don't reflect the locale.setlocale(...)

Is it supposed to work this way and is there another possibility to get the expected locale aware matching?

using Python 2.7.1, 32 bit; win 7 Home Premium 64-bit, Czech; regex-0.1.20110315.

Regards,
Vlastimil Brom

set qualifiers - feature idea

Original report by Anonymous.

Some background: I've been working with very large REs in CPython and IronPython. We generate the RE pattern from lists, like lists of cities or lists of names, somewhat like this:

namelist = open("names.txt").read().split()
pattern = re.compile("|".join(namelist))

The one I'm working with now is just a pattern for finding substrings that look like the name of a person. It's overflowing the System::Text::RegularExpressions buffers on IronPython, but works OK with CPython 2.6 on 64-bit Ubuntu.

One of the things I've been thinking is that this kind of pattern should be handled differently. Suppose there was some syntax like

pattern = re.compile("(?S<names>)", names=ImmutableSet(namelist))

where (?S indicates a named ImmutableSet, the members of that set to be drawn from the keyword argument of that name. The compiler would generate a reasonably fast pattern from that set, say the union of all characters in all the strings in the set, and a max and min size based on the min-lengthed and max-lengthed elements of the set. When the engine runs, it would match that fast pattern, and if it matches, it would then check to see if the matched group is a member of the named set. If so, the match would be confirmed; if not, it would fail.

Seems like this might be a useful feature for regex to have, given the popularity of this kind of machine-generated RE.

regex.compile("a(?#xxx)*") causes "_regex_core.error: nothing to repeat"

Original report by Masami HIRATA (Bitbucket: msmhrt, GitHub: msmhrt).

What steps will reproduce the problem?

>>> import regex
>>> regex.compile("a(?#xxx)*")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python32\3.2.2\lib\site-packages\regex.py", line 289, in compile
    return _compile(pattern, flags, kwargs)
  File "C:\Python32\3.2.2\lib\site-packages\regex.py", line 423, in _compile
    parsed = parse_pattern(source, info)
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 336, in parse_
pattern
    branches = [parse_sequence(source, info)]
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 355, in parse_
sequence
    item = parse_item(source, info)
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 365, in parse_
item
    element = parse_element(source, info)
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 698, in parse_
element
    raise error("nothing to repeat")
_regex_core.error: nothing to repeat
>>>

What is the expected output? What do you see instead?

>>> regex.compile("a(?#xxx)*")
regex.Regex('a(?#xxx)*', flags=regex.V0)
>>>

What version of the product are you using? On what operating system?

Windows XP Home SP3 (32-bit version)
Python 3.2.2 (default, Sep 4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win
32
regex 0.1.20120119

Please provide any additional information below.

HG history makes no sense

Original report by Anonymous.

What steps will reproduce the problem?

hg clone
hg log
say aloud "whut"

As far as I can tell there are two parallel histories on the "default" branch. Google Code's commit browser shows only one of these histories. hg log shows both histories interleaved, one commit after the other, producing a single nonsensical history. hgview shows these histories in parallel, but still asserts that both are on the same branch. Is there a problem with the repository? Am I doing something dumb?

I'd like to know with certainty the latest version of this project and the history that led to it.

= for fuzzy matches

Original report by Anonymous.

= operator could be pretty handy for fuzzy matches, finding only erroneous text. For example, in a list of hotmail email accounts, you could search for misspells like '@(hotmail.com){e=1}'. This will save the user an extra "grep -v" for filtering out correct emails in the list of matches.

regex.compile("^(?:a(?:(?:))+)+") causes "_regex_core.error: nothing to repeat"

Original report by Masami HIRATA (Bitbucket: msmhrt, GitHub: msmhrt).

What steps will reproduce the problem?

>>> import regex
>>> regex.compile("^(?:a(?:(?:))+)+")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python32\3.2.2\lib\site-packages\regex.py", line 289, in compile
    return _compile(pattern, flags, kwargs)
  File "C:\Python32\3.2.2\lib\site-packages\regex.py", line 423, in _compile
    parsed = parse_pattern(source, info)
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 336, in parse_
pattern
    branches = [parse_sequence(source, info)]
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 355, in parse_
sequence
    item = parse_item(source, info)
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 365, in parse_
item
    element = parse_element(source, info)
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 649, in parse_
element
    element = parse_paren(source, info)
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 783, in parse_
paren
    return parse_flags_subpattern(source, info)
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 998, in parse_
flags_subpattern
    subpattern = parse_pattern(source, info)
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 336, in parse_
pattern
    branches = [parse_sequence(source, info)]
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 355, in parse_
sequence
    item = parse_item(source, info)
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 365, in parse_
item
    element = parse_element(source, info)
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 698, in parse_
element
    raise error("nothing to repeat")
_regex_core.error: nothing to repeat
>>>

What is the expected output? What do you see instead?

>>> import regex
>>> regex.compile("^(?:a(?:(?:))+)+")
regex.Regex('^(?:a(?:(?:))+)+', flags=regex.V0)
>>>

What version of the product are you using? On what operating system?

Windows XP Home SP3 (32-bit version)
Python 3.2.2 (default, Sep 4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win
32
regex 0.1.20120119

Please provide any additional information below.

regex.search("^(a|)\\1{2}b", "b") returns None

Original report by Masami HIRATA (Bitbucket: msmhrt, GitHub: msmhrt).

What steps will reproduce the problem?

>>> import regex
>>> print(regex.search("^(a|)\\1{2}b", "b"))
None

What is the expected output? What do you see instead?

>>> import regex
>>> print(regex.search("^(a|)\\1{2}b", "b"))
<_regex.Match object at 0x00C09B10>
>>> print(regex.search("^(a|)\\1{2}b", "b").group(0,1))
('b', '')
>>>

What version of the product are you using? On what operating system?

Windows XP Home SP3 (32-bit version)
Python 3.2.2 (default, Sep 4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win
32
regex 0.1.20120112

Please provide any additional information below.

regex 0.1.20120103 has this issue too

regex.search("(a)", "a", flags=regex.V1).span(1) returns (0, 1) instead of (1, 1)

Original report by Masami HIRATA (Bitbucket: msmhrt, GitHub: msmhrt).

What steps will reproduce the problem?

>>> import regex
>>> print(regex.search("(a*)*", "a", flags=regex.V1).span(1))
(0, 1)
>>>

What is the expected output? What do you see instead?

>>> import regex
>>> print(regex.search("(a*)*", "a", flags=regex.V1).span(1))
(1, 1)
>>>

What version of the product are you using? On what operating system?

Windows XP Home SP3 (32-bit version)
Python 3.2.2 (default, Sep 4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win
32
regex 0.1.20120119

Please provide any additional information below.

The following works fine:

>>> import regex
>>> print(regex.search("(a*)*", "aa").span(1))
(2, 2)
>>> print(regex.search("(a*)*", "aaa").span(1))
(3, 3)

regex.compile("(?=abc){3}abc") causes "_regex_core.error: nothing to repeat"

Original report by Masami HIRATA (Bitbucket: msmhrt, GitHub: msmhrt).

What steps will reproduce the problem?

>>> import regex
>>> regex.compile("(?=abc){3}abc")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python32\3.2.2\lib\site-packages\regex.py", line 289, in compile
    return _compile(pattern, flags, kwargs)
  File "C:\Python32\3.2.2\lib\site-packages\regex.py", line 423, in _compile
    parsed = parse_pattern(source, info)
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 336, in parse_
pattern
    branches = [parse_sequence(source, info)]
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 352, in parse_
sequence
    item = parse_item(source, info)
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 370, in parse_
item
    raise error("nothing to repeat")
_regex_core.error: nothing to repeat
>>>

What is the expected output? What do you see instead?

>>> import regex
>>> regex.compile("(?=abc){3}abc")
regex.Regex('(?=abc){3}abc', flags=regex.V0)
>>> regex.search("(?=abc){3}abc", "abcabcabc").span(0)
(0, 3)
>>>

What version of the product are you using? On what operating system?

Windows XP Home SP3 (32-bit version)
Python 3.2.2 (default, Sep 4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win
32
regex 0.1.20120119

Please provide any additional information below.

regex.search("(?>.*/)b", "a/b") returns None

Original report by Masami HIRATA (Bitbucket: msmhrt, GitHub: msmhrt).

What steps will reproduce the problem?

>>> import regex
>>> print(regex.search("(?>.*/)b", "a/b"))
None
>>>

What is the expected output? What do you see instead?

>>> import regex
>>> print(regex.search("(?>.*/)b", "a/b"))
<_regex.Match object at 0x00C0BBB8>
>>> print(regex.search("(?>.*/)b", "a/b").group(0))
a/b
>>>

What version of the product are you using? On what operating system?

Windows XP Home SP3 (32-bit version)
Python 3.2.2 (default, Sep 4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win
32
regex 0.1.20120114

Please provide any additional information below.

"bad set" error for unescaped ] at the beginning of the set

Original report by Anonymous.

Hi,

I just found one inconsistence of regex against re in handling of the sets (it might depend on the newest addition of set operations).

I thought, a pattern like "[][]" would be legal (although probably not very readable). It also does work in re, but in regex it causes a "bad set" error:

>>> print re.sub(r"([][])", r"-", u"a[b]c")
a-b-c
>>> print regex.sub(r"([][])", r"-", u"a[b]c")
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "regex.pyc", line 194, in sub
  File "regex.pyc", line 334, in _compile
  File "_regex_core.pyc", line 243, in _parse_pattern
  File "_regex_core.pyc", line 257, in _parse_sequence
  File "_regex_core.pyc", line 270, in _parse_item
  File "_regex_core.pyc", line 369, in _parse_element
  File "_regex_core.pyc", line 503, in _parse_paren
  File "_regex_core.pyc", line 243, in _parse_pattern
  File "_regex_core.pyc", line 257, in _parse_sequence
  File "_regex_core.pyc", line 270, in _parse_item
  File "_regex_core.pyc", line 382, in _parse_element
  File "_regex_core.pyc", line 924, in _parse_set
  File "_regex_core.pyc", line 933, in _parse_set_union
  File "_regex_core.pyc", line 943, in _parse_set_symm_diff
  File "_regex_core.pyc", line 950, in _parse_set_inter
  File "_regex_core.pyc", line 957, in _parse_set_diff
  File "_regex_core.pyc", line 971, in _parse_set_imp_union
  File "_regex_core.pyc", line 978, in _parse_set_member
  File "_regex_core.pyc", line 1046, in _parse_set_item
  File "_regex_core.pyc", line 933, in _parse_set_union
  File "_regex_core.pyc", line 943, in _parse_set_symm_diff
  File "_regex_core.pyc", line 950, in _parse_set_inter
  File "_regex_core.pyc", line 957, in _parse_set_diff
  File "_regex_core.pyc", line 971, in _parse_set_imp_union
  File "_regex_core.pyc", line 978, in _parse_set_member
  File "_regex_core.pyc", line 1048, in _parse_set_item
error: bad set

It can be easily remedied (after I found the problem in a more complex pattern) by escaping the square brackets:

>>> print re.sub(r"([\]\[])", r"-", u"a[b]c")
a-b-c
>>> print regex.sub(r"([\]\[])", r"-", u"a[b]c")
a-b-c
>>>

Using regex-0.1.20110510 python 2.7.1, Win XP

regards,
vbr

Problem with shared iterators

Original report by Anonymous.

Issue 1366311 talks about releasing the GIL in the "re" module, but points to a problem which could occur with scanner objects which are shared across threads.

Unfortunately this module has just that problem. :-(

I'm currently attempting to fix it, but it's proving to be surprisingly difficult. When the iterator terminates in a thread, it's being collected, which causes a failure later when another thread tries to use it.

Revision 4dc5f7f181 didn't fix it.

casefolding specification

Original report by Anonymous.

First, thanks for the new release (regex 0.1.20110917) ! (I especially like the changed fuzzy matching behaviour as discussed in
https://code.google.com/p/mrab-regex-hg/issues/detail?id=12#c28 )

I'd like to ask about the specification of the case-folding behaviour used in case insensitive matching.

Is it the chapter 5.18 in the Unocode standard

http://www.unicode.org/versions/Unicode6.0.0/ch05.pdf

Or did I miss something else?

I tried some patterns, where I thought these would be "caselessly" equivalent (based on the above)

>>> for m in regex.findall(ur"(?V1i)[ΣΟ]",u"ρς στΣόο"): print m,
... 
σ Σ ο

Here I'd have thought, the accented lowercase omicron or the positional lower sigma variant would be matched too.

On the other hand the sharp s (which is more frequent in my texts) seems to be matched in all directions.)

>>> for m in regex.findall(ur"(?V1i)ẞ",u"-s-S-ss-SS-ß-ẞ-"): print m,
... 
ss SS ß ẞ
>>> for m in regex.findall(ur"(?V1i)ss",u"-s-S-ss-SS-ß-ẞ-"): print m,
... 
ss SS ß ẞ
>>>

I thought, that only the changes in case should be reflected in matching, now there is effectively an equivalence between both lowercase ss and ß, which is not (at least not always) what is expected. (Both, with respect to the current German orthography or for dealing with text preceeding that official orthography regulation.)

Is there now some way to handle these characters as distinct (other than not using the i flag)?

Where can I maybe find the specification for this behaviour? - it seems, that I will need to reflect it in the search patterns.

(I can't comment competently on the behaviour of the "prominent" case of the Turkic "i"s; personally I believe, there must be other comparable cases, once we begin to care about them... I'd support the view of some contributors in the respective py-list thread ( http://mail.python.org/pipermail/python-list/2011-September/1280544.html ), that such cases are better dealt with individually, on an application basis, if it need be. (I'd just prefer keeping the flags repertoire shorter, if possible:-)

Regards,
Vlastimil Brom

module alias for regex_version1

Original report by Anonymous.

Now that other features like nested sets were moved under the VERSION1 for compatibility with re, I am thinking, how to let the regex behaviour in my scripts default to the new VERSION1 (if possible without adding the flag all over the code).
It has been already explained, that it can't be a module-wide setting, as this would influence other programs or libraries using it.

Before I try some hackish approach, I wanted to ask, whether some kind of "aliasing" the module on import could work.
I mentioned this idea marginally in
http://bugs.python.org/issue2636#msg143442 but the discussion was concentrated on other topics.

Would it be possible to have something like

import regex_version0 as re

vs:

import regex_version1 as re

while the opposed flags could be still set if needed?

Could this be achieved (without much code duplication), or are there some drawbacks or limits?
Or are there other approaches how to activate the version1 behaviour in the user code "at once"?

Regards,
vbr

Forward references; nested references?

Original report by Anonymous.

I'd like to ask about the support for forward references and nested references in regex ( http://www.regular-expressions.info/brackets.html ).

I couldn't find any notice of this in the documentation, but it seems, that forward references are supported, while nested references are not:

>>> regex.search(r"(\2b|(a))+", "-aab-").group()
'aab'
>>> 
>>> regex.search(r"(\1b|(a))+", "-aab-").group()
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "C:\Python27\lib\regex.py", line 235, in search
    return _compile(pattern, flags, kwargs).search(string, pos, endpos,
  File "C:\Python27\lib\regex.py", line 423, in _compile
    parsed = parse_pattern(source, info)
  File "C:\Python27\lib\_regex_core.py", line 334, in parse_pattern
    branches = [parse_sequence(source, info)]
  File "C:\Python27\lib\_regex_core.py", line 350, in parse_sequence
    item = parse_item(source, info)
  File "C:\Python27\lib\_regex_core.py", line 363, in parse_item
    element = parse_element(source, info)
  File "C:\Python27\lib\_regex_core.py", line 587, in parse_element
    element = parse_paren(source, info)
  File "C:\Python27\lib\_regex_core.py", line 723, in parse_paren
    subpattern = parse_pattern(source, info)
  File "C:\Python27\lib\_regex_core.py", line 334, in parse_pattern
    branches = [parse_sequence(source, info)]
  File "C:\Python27\lib\_regex_core.py", line 350, in parse_sequence
    item = parse_item(source, info)
  File "C:\Python27\lib\_regex_core.py", line 363, in parse_item
    element = parse_element(source, info)
  File "C:\Python27\lib\_regex_core.py", line 584, in parse_element
    return parse_escape(source, info, False)
  File "C:\Python27\lib\_regex_core.py", line 1035, in parse_escape
    return parse_numeric_escape(source, info, ch, in_set)
  File "C:\Python27\lib\_regex_core.py", line 1069, in parse_numeric_escape
    raise error("can't refer to an open group")
error: can't refer to an open group
>>>

Is it true, or am I misinterpretting something?

(re fails with the same error message for the second pattern and "bogus escape: '\\2'" for the first one.)

thanks and regards,
vbr

regex.search("(\$)?[^()]+(?(1)\$|)", "(abcd").group(0) returns "bcd" instead of "abcd"

Original report by Masami HIRATA (Bitbucket: msmhrt, GitHub: msmhrt).

What steps will reproduce the problem?

>>> import regex
>>> print(regex.search("(\\()?[^()]+(?(1)\\)|)", "(abcd").group(0))
bcd
>>>

What is the expected output? What do you see instead?

>>> import regex
>>> print(regex.search("(\\()?[^()]+(?(1)\\)|)", "(abcd").group(0))
abcd
>>>

What version of the product are you using? On what operating system?

Windows XP Home SP3 (32-bit version)
Python 3.2.2 (default, Sep 4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win
32
regex 0.1.20120114

Please provide any additional information below.

regex.search("(a)(?<=b(?1))", "baz", regex.V1) returns None incorrectly

Original report by Masami HIRATA (Bitbucket: msmhrt, GitHub: msmhrt).

What steps will reproduce the problem?

>>> import regex
>>> print(regex.search("(a)(?<=b(?1))", "baz", regex.V1))
None
>>>

What is the expected output? What do you see instead?

>>> import regex
>>> print(regex.search("(a)(?<=b(?1))", "baz", regex.V1))
<_regex.Match object at 0x00C09C28>
>>> print(regex.search("(a)(?<=b(?1))", "baz", regex.V1).group(0))
a
>>>

What version of the product are you using? On what operating system?

Windows XP Home SP3 (32-bit version)
Python 3.2.2 (default, Sep 4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win
32
regex 0.1.20120123

Please provide any additional information below.

nested sets in case insensitive mode ("invalid RE code")

Original report by Anonymous.

I just encountered an error hwile using nested sets in case insensitive mode:

>>> regex.findall("(?V1)[[a-z]--[aei]]","abc")
['b', 'c']
>>> regex.findall("(?V1i)[[a-z]--[aei]]","abc")
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "C:\Python27\lib\regex.py", line 276, in findall
    return _compile(pattern, flags, kwargs).findall(string, pos, endpos,
  File "C:\Python27\lib\regex.py", line 499, in _compile
    named_lists, named_list_indexes)
RuntimeError: invalid RE code
>>> regex.findall("(?V0i)[[a-z]--[aei]]","abc")
[]
>>>

using regex-0.1.20111004; python 2.7.2 on WinXP; I am not sure, when it first appeared, but it would probably not be there for a long time.

regards,
vbr

mrabarnett / mrab-regex Goto Github PK

mrab-regex's People

Contributors

Stargazers

Watchers

Forkers

mrab-regex's Issues

Recommend Projects

Recommend Topics

Recommend Org