Regular Expression

Use Cases

  • Find whole word: \bword\b

Character classes

  • . matches any character except a line terminator unless the DOTALL flag is activated.

Modes

Flags

  • Java
    • Pattern.DOTALL: (?s)
    • Pattern.CASE_INSENSITIVE: (?i)
    • Pattern.MULTILINE: (?m)

Usage

  • Java

    • Multiple modes:

      Pattern.DOTALL | Pattern.MULTILINE
    • Or can be enabled via the embedded flag expression at the start of the pattern string:

      Pattern.compile("(?s)aa(.*)bb", Pattern.CASE_INSENSITIVE );

Boundary matchers

Capturing group and non-capturing group

  • Group 0 usually means the whole expression.
PatternDescription
(?<name>X)X, as a named-capturing group
(?:X)X, as a non-capturing group

Lookahead

  • Extensive language support
  • Positive lookahead: HAVE the specified pattern ahead of your match
  • Negative lookahead: DOES NOT HAVE the specified pattern ahead of your match
  • Lookahead group must follow the intended match.
PatternDescription
(?=X)X, via zero-width positive lookahead
(?!X)X, via zero-width negative lookahead

Lookbehind

  • Not enough language support
  • Lookahead group must precede the intended match.
PatternDescription
(?<=X)X, via zero-width positive lookbehind
(?<!X)X, via zero-width negative lookbehind

Backreference

PatternDescription
\nWhatever the nth capturing group matched
\k<name>Whatever the named-capturing group "name" matched

Quantifier

  • Greedy: the greatest possible match, starting from the entire input string, backing off by one character each attempt.
  • Reluctant: the smallest possible match, starting from the beginning of the input string, expanding by one character each attempt.
  • Possessive: only try once for entire input string, useful to prevent the regex engine from trying all permutations, primarily for performance reasons.

Implementations

POSIX

BRE (Basic Regular Expression) (opens in a new tab)

ERE (Extended Regular Expression) (opens in a new tab)

PCRE (Perl Compatible Regular Expression)

JVM

Java

  • Calling matcher.group() must be preceded by a call to matcher.find(), matcher.matches(), or matcher.lookingAt(), no text would be found otherwise.

    A matcher is created using the pattern.matcher(String) method call, but we need to invoke one of these three methods to perform a match operation.

  • In some cases, line boundary matcher ^ and $ might not work on another platform. For example, on Windows, they won't match text with Linux line separators. In this case, use \R (JDK 8+) to match any line separator.

Groovy

Performance

  • Pitfalls

    • It is more common to accidentally create regexes that run in quadratic time.

    • Recompilation

    • Dot-star in the middle (which causes backtracking)

      • Solution 1: Use negated character class
      • Solution 2: Use reluctant quantifiers
    • Nested Repetition

  • Tips

    • Use non-capturing groups when you need parentheses but not capture.

    • If the regex is very complex, do a quick spot-check before attempting a match, e.g:

      • Does an email address contain '@'?
    • Present the most likely alternative(s) first, e.g:

      • black|white|blue|red|green|metallic seaweed
    • Reduce the amount of looping the engine has to do

      • \d\d\d\d\d is faster than \d{5}

      • aaaa+ is faster than a{4,}

    • Avoid obvious backtracking, e.g:

      • Mr|Ms|Mrs should be M(?:rs?|s)

      • Good morning|Good evening should be Good (?:morning|evening)