Regex in Java: Matching Line Breaks With a Dot
Last week I set out to write some unit-tests for a Java application. Because the assertions target variable fields in a JSON string, I decided to use regular expressions (regex) to match these fields. In doing this, I learned a few interesting things, which weren’t immediately obvious to me, about how regex are implemented in Java.
Based on my experience with regex in Python, I started with the following code snippet. In this example, I want to assert whether the key: value
pair is present anywhere in the exampleJsonString
(which contains line breaks).
// Assign exampleJsonString and exampleRegex for clarity
String exampleJsonString = "{\n key : value, \n(...)\n}";
String exampleRegex = ".*key : value.*";
assertTrue(exampleJsonString.matches(exampleRegex));
This did not work because in Java’s implementation of regex the dot-wildcard .
does not include line terminators. This behaviour wasn’t immediately obvious to me and it took a while to figure out that this is implemented in some flavours of regex. To cite the Java 14 docs on regex:
Predefined character classes
. Any character (may or may not match line terminators)
With line-terminators being defined as:
A line terminator is a one- or two-character sequence that marks the end of a line of the input character sequence. The following are recognized as line terminators:
- A newline (line feed) character (‘\n’),
- A carriage-return character followed immediately by a newline character (“\r\n”),
- A standalone carriage-return character (‘\r’),
- A next-line character (‘\u0085’),
- A line-separator character (‘\u2028’), or
- A paragraph-separator character (‘\u2029’).
There are two ways to get around this limitation. The first way I found uses Pattern.compile()
to compile the regex into an instance of the Pattern
class. This allows you to pass the Pattern.DOTALL
flag, which makes the .
match any character including line breaks, as follows:
String exampleJsonString = "{\n key : value, \n(...)\n}";
Pattern exampleRegex = Pattern.compile(".*key : value.*", Pattern.DOTALL)
assertTrue(exampleRegex.matcher(testJson).find());
Alternatively, the Pattern.DOTALL
mode can be enabled via the embedded flag expression (?s)
. According to the docs ‘[t]he s
is a mnemonic for “single-line” mode, which is what this is called in Perl’. This results in:
String exampleJsonString = "{\n key : value, \n(...)\n}";
String exampleRegex = "(?s).*key : value.*";
assertTrue(exampleJsonString.matches(exampleRegex));
While both solutions exhibit the same functional behaviour, each has its benefits. The first solution performs better; executing the assertion 1 million times in a test method takes 251 ms for the first against 509 ms for the second. The second solution is closer to what I started with and feels more straightforward to me. Embedded flags can be used on specific parts of the regex, such as within a capture group at the expense of making the regex potentially harder to read; this holds for both compiled and uncompiled regex.
The Double-Escape
I’ll end with a short note on string compilation. Java compiles strings, which mean that escaping characters in regex requires some close attention. Say, I want to match a digit using \d
, escaping it with a single \
means the Java compiler interprets it as an escape character (depending on language level this can even be considered illegal) instead of interpreting it as part of a regex. So instead, to construct a regex to match a digit String regex = "\d";
, we need to construct the literal \
preceding a d
, as such: String regex = "\\d";
. While this is a rather simple example, you can end up with quite a few backslashes and a complaining IDE.