Giter VIP home page Giter VIP logo

regex's Introduction

Regex

Open source regex engine.

Warning. Not meant to be used in production, created for learning purposes!
See Let's Build a Regex Engine series to learn how this project came to be.

Usage

Create a Regex object by providing a pattern and an optional set of options (Regex.Options):

let regex = try Regex(#"<\/?[\w\s]*>|<.+[\W]>"#)

The pattern is parsed and compiled to the special internal representation. If there is an error in the pattern, the initializer will throw a detailed error with an index of the failing token and an error message.

Use isMatch(_:) method to check if the regular expression patterns occurs in the input text:

regex.isMatch("<h1>Title</h1>")

Retrieve one or all occurrences text that matches the regular expression by calling matches(in:) method. Each match contains a range in the input string.

for match in regex.matches(in: "<h1>Title</h1>\n<p>Text</p>") {
    print(match.value)
    // Prints ["<h1>", "</h1>", "<p>", "</p>"]
}

If you just want a single match, use regex.firstMatch(in:).

Regex is fully thead safe.

Features

Character Classes

A character class matches any one of a set of characters.

  • [character_group] – matches any single character in character_group, e.g. [ae]
  • [^</b><i>character_group</i><b>] – negation, matches any single character that is not in character_group, e.g. [^ae]
  • [first-last] – character range, matches any single character in the given range from first to last, e.g. [a-z]
  • . – wildcard, matches any single character except \n
  • \w - matches any word character (negation: \W)
  • \s - matches any whitespace character (negation: \S)
  • \d - matches any decimal digit (negation: \D)
  • \z - matches end of string (negation: \Z)
  • \p{name} - matches characters from the given unicode category, e.g. \p{P} for punctuation characters (supported categories: P, Lt, Ll, N, S) (negation: \P)

Characters consisting of multiple unicode scalars (extended grapheme clusters) are interpreted as single characters, e.g. pattern "🇺🇸+" matches "🇺🇸" and "🇺🇸🇺🇸" but not "🇸🇸". But when used inside character group, each unicode scalar is interpreted separately, e.g. pattern "[🇺🇸]" matches "🇺🇸" and "🇸🇸" which consist of the same scalars.

Character Escapes

The backslash (\) either indicates that the character that follows is a special character or that the keyword should be interpreted literally.

  • \keyword – interprets the keyword literally, e.g. \{ matches the opening bracket
  • \special_character – interprets the special character, e.g. \b matches word boundary (more info in "Anchors")
  • \u{nnnn} – matches a UTF-16 code unit, e.g. \u0020 matches escape (Swift-specific feature)

Anchors

Anchors specify a position in the string where a match must occur.

  • ^ – matches the beginning of the string (or beginning of the line when .multiline option is enabled)
  • $ – matches the end of the string or \n at the end of the string (end of the line in .multiline mode)
  • \A – matches the beginning of the string (ignores .multiline option)
  • \Z – matches the end of the string or \n at the end of the string (ignores .multiline option)
  • \z – matches the end of the string (ignores .multiline option)
  • \G – match must occur at the point where the previous match ended
  • \b – match must occur on a boundary between a word character and a non-word character (negation: \B)

Grouping Constructs

Grouping constructs delineate the subexpressions of a regular expression and capture the substrings of an input string.

  • (subexpression) – captures a subexpression in a group
  • (?:subexpression) – non-capturing group

Backreferences

Backreferences provide a convenient way to identify a repeated character or substring within a string.

  • \number – matches the capture group at the given ordinal position e.g. \4 matches the content of the fourth group

If the referenced group can't be found in the pattern, the error will be thrown.

Quantifiers

Quantifiers specify how many instances of a character, group, or character class must be present in the input for a match to be found.

  • * – match zero or more times
  • + – match one or more times
  • ? – match zero or one time
  • {n} – match exactly n times
  • {n,} – match at least n times
  • {n,m} – match from n to m times, closed range, e.g. a{3,4}

All quantifiers are greedy by default, they try to match as many occurrences of the pattern as possible. Append the ? character to a quantifier to make it lazy and match as few occurrences as possible, e.g. a+?.

Warning: lazy quantifiers might be used to control which groups and matches are captured, but they shouldn't be used to optimize matcher performance which already uses an algorithm which can handle even nested greedy quantifiers.

Alternation

  • | – match either left side or right side

Options

Regex can be initialized with a set of options (Regex.Options).

  • .caseInsensitive – match letters in the pattern independent of case.
  • .multiline – control the behavior of ^ and $ anchors. By default, these match at the start and end of the input text. If this flag is set, will match at the start and end of each line within the input text.
  • .dotMatchesLineSeparators – allow . to match any character, including line separators.

Not supported Features

  • Most unicode categories are not support, e.g.\p{Sc} (currency symbols) is not supported
  • Character class subtraction, e.g. [a-z-[b-f]]
  • Named blocks, e.g. \p{IsGreek}

Grammar

See Grammar.ebnf for a formal description of the language using EBNF notation. See Grammar.xhtml for a visualization (railroad diagram) of the grammar generated thanks to https://www.bottlecaps.de/rr/ui.

References

License

Regex is available under the MIT license. See the LICENSE file for more info.

regex's People

Contributors

imvm avatar kean avatar mbogh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

regex's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.