Use Cases
- Find whole word:
\bword\b
Character classes
.matches any character except a line terminator unless the DOTALL flag is activated.
Modes
Flags
- Java
- Pattern.DOTALL:
(?s) - Pattern.CASE_INSENSITIVE:
(?i) - Pattern.MULTILINE:
(?m)
- Pattern.DOTALL:
Usage
-
Java
-
Multiple modes:
Pattern.DOTALL | Pattern.MULTILINE -
Or can be enabled via the embedded flag expression at the start of the pattern string:
Pattern.compile("(?s)aa(.*)bb", Pattern.CASE_INSENSITIVE );
-
Boundary matchers
Capturing group and non-capturing group
- Group 0 usually means the whole expression.
| Pattern | Description |
|---|---|
(?<name>X) | X, as a named-capturing group |
(?:X) | X, as a non-capturing group |
Lookahead
- Extensive language support
- Positive lookahead: HAVE the specified pattern ahead of your match
- Negative lookahead: DOES NOT HAVE the specified pattern ahead of your match
- Lookahead group must follow the intended match.
| Pattern | Description |
|---|---|
(?=X) | X, via zero-width positive lookahead |
(?!X) | X, via zero-width negative lookahead |
Lookbehind
- Not enough language support
- Lookahead group must precede the intended match.
| Pattern | Description |
|---|---|
(?<=X) | X, via zero-width positive lookbehind |
(?<!X) | X, via zero-width negative lookbehind |
Backreference
| Pattern | Description |
|---|---|
\n | Whatever the nth capturing group matched |
\k<name> | Whatever the named-capturing group "name" matched |
Quantifier
- Greedy: the greatest possible match, starting from the entire input string, backing off by one character each attempt.
- Reluctant: the smallest possible match, starting from the beginning of the input string, expanding by one character each attempt.
- Possessive: only try once for entire input string, useful to prevent the regex engine from trying all permutations, primarily for performance reasons.
Implementations
POSIX
BRE (Basic Regular Expression) (opens in a new tab)
ERE (Extended Regular Expression) (opens in a new tab)
PCRE (Perl Compatible Regular Expression)
JVM
Java
-
Calling
matcher.group()must be preceded by a call tomatcher.find(),matcher.matches(), ormatcher.lookingAt(), no text would be found otherwise.A matcher is created using the
pattern.matcher(String)method call, but we need to invoke one of these three methods to perform a match operation. -
In some cases, line boundary matcher
^and$might not work on another platform. For example, on Windows, they won't match text with Linux line separators. In this case, use\R(JDK 8+) to match any line separator.
Groovy
-
The regex find operator,
=~:text =~ pattern-boolean java.util.regex.Matcher.find() -
The regex match operator,
==~:text ==~ pattern-boolean java.util.regex.Matcher.matches() -
The regex pattern operator,
~string:~"\d\w"-java.util.regex.Pattern -
Reference: Groovy Regular Expression Operators (opens in a new tab)
-
Code Example: RegexSpec (opens in a new tab)
Performance
-
Pitfalls
-
It is more common to accidentally create
regexesthat run inquadratic time. -
Recompilation
-
Dot-star in the middle (which causes backtracking)
- Solution 1: Use
negated character class - Solution 2: Use
reluctant quantifiers
- Solution 1: Use
-
Nested Repetition
-
-
Tips
-
Use
non-capturing groupswhen you need parentheses but not capture. -
If the
regexis very complex, do a quick spot-check before attempting a match, e.g:- Does an email address contain '
@'?
- Does an email address contain '
-
Present the most likely alternative(s) first, e.g:
black|white|blue|red|green|metallic seaweed
-
Reduce the amount of
loopingthe engine has to do-
\d\d\d\d\dis faster than\d{5} -
aaaa+is faster thana{4,}
-
-
Avoid obvious
backtracking, e.g:-
Mr|Ms|Mrsshould beM(?:rs?|s) -
Good morning|Good eveningshould beGood (?:morning|evening)
-
-