Use Cases
- Find whole word:
\b
word\b
Character classes
.
matches any character except a line terminator unless the DOTALL flag is activated.
Modes
Flags
- Java
- Pattern.DOTALL:
(?s)
- Pattern.CASE_INSENSITIVE:
(?i)
- Pattern.MULTILINE:
(?m)
- Pattern.DOTALL:
Usage
-
Java
-
Multiple modes:
Pattern.DOTALL | Pattern.MULTILINE
-
Or can be enabled via the embedded flag expression at the start of the pattern string:
Pattern.compile("(?s)aa(.*)bb", Pattern.CASE_INSENSITIVE );
-
Boundary matchers
Capturing group and non-capturing group
- Group 0 usually means the whole expression.
Pattern | Description |
---|---|
(?<name>X) | X, as a named-capturing group |
(?:X) | X, as a non-capturing group |
Lookahead
- Extensive language support
- Positive lookahead: HAVE the specified pattern ahead of your match
- Negative lookahead: DOES NOT HAVE the specified pattern ahead of your match
- Lookahead group must follow the intended match.
Pattern | Description |
---|---|
(?=X) | X, via zero-width positive lookahead |
(?!X) | X, via zero-width negative lookahead |
Lookbehind
- Not enough language support
- Lookahead group must precede the intended match.
Pattern | Description |
---|---|
(?<=X) | X, via zero-width positive lookbehind |
(?<!X) | X, via zero-width negative lookbehind |
Backreference
Pattern | Description |
---|---|
\n | Whatever the nth capturing group matched |
\k<name> | Whatever the named-capturing group "name" matched |
Quantifier
- Greedy: the greatest possible match, starting from the entire input string, backing off by one character each attempt.
- Reluctant: the smallest possible match, starting from the beginning of the input string, expanding by one character each attempt.
- Possessive: only try once for entire input string, useful to prevent the regex engine from trying all permutations, primarily for performance reasons.
Implementations
POSIX
BRE (Basic Regular Expression) (opens in a new tab)
ERE (Extended Regular Expression) (opens in a new tab)
PCRE (Perl Compatible Regular Expression)
JVM
Java
-
Calling
matcher.group()
must be preceded by a call tomatcher.find()
,matcher.matches()
, ormatcher.lookingAt()
, no text would be found otherwise.A matcher is created using the
pattern.matcher(String)
method call, but we need to invoke one of these three methods to perform a match operation. -
In some cases, line boundary matcher
^
and$
might not work on another platform. For example, on Windows, they won't match text with Linux line separators. In this case, use\R
(JDK 8+) to match any line separator.
Groovy
-
The regex find operator,
=~
:text =~ pattern
-boolean java.util.regex.Matcher.find()
-
The regex match operator,
==~
:text ==~ pattern
-boolean java.util.regex.Matcher.matches()
-
The regex pattern operator,
~string
:~"\d\w"
-java.util.regex.Pattern
-
Reference: Groovy Regular Expression Operators (opens in a new tab)
-
Code Example: RegexSpec (opens in a new tab)
Performance
-
Pitfalls
-
It is more common to accidentally create
regexes
that run inquadratic time
. -
Recompilation
-
Dot-star in the middle (which causes backtracking)
- Solution 1: Use
negated character class
- Solution 2: Use
reluctant quantifiers
- Solution 1: Use
-
Nested Repetition
-
-
Tips
-
Use
non-capturing groups
when you need parentheses but not capture. -
If the
regex
is very complex, do a quick spot-check before attempting a match, e.g:- Does an email address contain '
@
'?
- Does an email address contain '
-
Present the most likely alternative(s) first, e.g:
black|white|blue|red|green|metallic seaweed
-
Reduce the amount of
looping
the engine has to do-
\d\d\d\d\d
is faster than\d{5}
-
aaaa+
is faster thana{4,}
-
-
Avoid obvious
backtracking
, e.g:-
Mr|Ms|Mrs
should beM(?:rs?|s)
-
Good morning|Good evening
should beGood (?:morning|evening)
-
-