Trois Îles, Luxembourg

Regular expressions, or short "regex", are a pattern of characters and metacharacters that can be used for matching strings. For example, the pattern "gr[ae]y" matches both the strings "gray" and "grey".

While regular expressions are an integral part of other popular languages, they have been introduced to the Java world rather late with the release of Java 1.4 in 2002. Perl, certainly the mother language of modern regexes, already turned 15 that year.

Regexes are sometimes hard to understand, but once you got the hang of them, they will soon become your weapon of choice when you have to deal with texts.

In this article, I focus on Java code patterns for common scenarios. If you have never heard of regular expressions before, the Wikipedia article and the Pattern JavaDoc are good starting points. The Regex Crossword site is a great place for working out your regex muscles.

Matching

The primary thing you can do with regular expressions is to check if a string matches a pattern.

boolean match = Pattern.matches(".*Cream.*", "Ice Cream Sandwich");
assertThat(match, is(true));

The Pattern.matches() method compiles the pattern everytime before matching the string. When the pattern is used repeatedly, it's better to precompile it once and reuse the Pattern instance, and then use Pattern.matcher() to create a Matcher object:

Pattern p = Pattern.compile(".*Cream.*");

boolean match = p.matcher("Ice Cream Sandwich").matches();
assertThat(match, is(true));

boolean match2 = p.matcher("Jelly Bean").matches();
assertThat(match2, is(false));

Matcher.matches() returns true only if the entire string is matching the regular expression. To find out if the regular expression matches within the string, use Matcher.find() instead:

Pattern p = Pattern.compile("Cream");

boolean find = p.matcher("Ice Cream Sandwich").find();
assertThat(find, is(true));     // a part of the string has matched

boolean match = p.matcher("Ice Cream Sandwich").matches();
assertThat(match, is(false));   // but the regex does not match the entire string

A pattern can also be used as find predicate (e.g. for filtering):

List<String> result = Stream.of("Pear", "Plum", "Honey", "Cherry Pie")
        .filter(Pattern.compile("P.*").asPredicate())
        .collect(Collectors.toList());
assertThat(result, contains("Pear", "Plum", "Cherry Pie"));

Note that "Cherry Pie" is matching as well because this is a find predicate, so the pattern just needs to match a part of the string. Java 11 also permits match predicates, to match the entire expression:

List<String> result = Stream.of("Pear", "Plum", "Honey", "Cherry Pie")
        .filter(Pattern.compile("P.*").asMatchPredicate())
        .collect(Collectors.toList());
assertThat(result, contains("Pear", "Plum"));

Do you find a way how you can get the same result with a find predicate?

Splitting

Texts can be split at a delimiter using regular expressions. The following example splits a CSV line, accepting both comma and semicolon as delimiter characters:

Pattern p = Pattern.compile("[;,]");
String[] result = p.split("123,abc;foo");
assertThat(result, arrayContaining("123", "abc", "foo"));

It is also possible to split straight into a Stream:

Pattern p = Pattern.compile("[;,]");
List<String> result = p.splitAsStream("123,abc;foo")
        .collect(Collectors.toList());
assertThat(result, contains("123", "abc", "foo"));

Extracting

Regular expressions are extremely useful for locating and extracting certain parts of a string. For example, let's say we have an ISO date string and we would like to extract the year, month, and day. Parentheses are used for marking the desired groups in the pattern. The matching part of each group can then be read by its positional number:

Pattern p = Pattern.compile("(\\d{4})-(\\d{2})-(\\d{2})T.*");
Matcher m = p.matcher("2014-08-27T21:33:11Z");
if (m.matches()) {
    String year = m.group(1);
    String month = m.group(2);
    String day = m.group(3);
    assertThat(year, is("2014"));
    assertThat(month, is("08"));
    assertThat(day, is("27"));
}

Groups are counted by their left parenthesis, starting from 1. Group number 0 always refers to the entire match. It's even better to use group names, so you won't need to care about their positions:

Pattern p = Pattern.compile("(?<year>\\d{4})-(?<month>\\d{2})-(?<day>\\d{2})T.*");
Matcher m = p.matcher("2014-08-27T21:33:11Z");
if (m.matches()) {
    String day = m.group("day");
    String month = m.group("month");
    String year = m.group("year");
    assertThat(day, is("27"));
    assertThat(month, is("08"));
    assertThat(year, is("2014"));
}

Note that you must always invoke matches() before invoking group(), even when you are absolutely sure that the text is matching.

Replacing

Let's replace some text! The next example replaces the word "apple" by the word "cherry":

Pattern p = Pattern.compile("apple");
Matcher m = p.matcher("sweet apple pie");
String result = m.replaceAll("cherry");
assertThat(result, is("sweet cherry pie"));

This was simple. However, this example would also convert a "sweet pineapple pie" to a "sweet pinecherry pie". Do you find a way how to only match the word "apple"?

Let's make it more challenging and replace period decimal separators by comma, but leave punctuation marks unchanged. We will match decimal numbers and use group references $1 and $2 in the replacement string:

Pattern p = Pattern.compile("(\\d+)\\.(\\d+)");
Matcher m = p.matcher("This is a book. It costs €35.71.");
String result = m.replaceAll("$1,$2");
assertThat(result, is("This is a book. It costs €35,71."));

What if we would like to compute the replacement string at runtime? In the next example, the name of a special ingredient in a famous Monty Python quote is converted to upper case. For the sake of this example, String.toUpperCase() is used instead of just replacing the lower case word by the upper case word.

Pattern p = Pattern.compile("spam");
Matcher m = p.matcher("spam, egg, spam, spam, bacon and spam");
String result = m.replaceAll(r -> r.group().toUpperCase());
assertThat(result, is("SPAM, egg, SPAM, SPAM, bacon and SPAM"));

The example above requires Java 9 or higher. If you need to use Java 8, you can simulate replaceAll() with this helper method:

public static String replaceAll(Matcher m, Function<MatchResult, String> replacer) {
    StringBuffer sb = new StringBuffer();
    while (m.find()) {
        m.appendReplacement(sb, replacer.apply(m));
    }
    m.appendTail(sb);
    return sb.toString();
}

Quoting

To be honest, writing regular expressions in Java can be a real pain sometimes. Other languages offer regex literals, like /\d+/. With Java, we're not that lucky. We only have plain string literals, so we need to escape each regex backslash with another backslash:

Pattern p = Pattern.compile("\\d+"); // regex: \d+

Even worse, if we want to match a backslash character, we have to actually write it four times (twice for the regular expression and twice again for the Java string):

Pattern p = Pattern.compile("C:\\\\"); // regex: C:\\ , matches C:\

Java 12 was supposed to bring raw string literals, which would have cleaned up the backslash mess a bit. Sadly, this feature has been dropped before the final release.

Quoting is used when the search string contains regex meta characters. For example, when we would like to match the ASCII representation of the copyright symbol "(c)", a regular expression of "(c)" would actually match any "c" character. We have to use backslashes to escape the meaning of the parentheses: "\(c\)" (and then double the backslashes in the Java string).

The Pattern.quote() method helps us quoting fixed strings:

Pattern p = Pattern.compile(".*" + Pattern.quote("(c)") + ".*");
boolean copyrighted = p.matcher("Material is (c) 2014").matches();
assertThat(copyrighted, is(true));

In one of the examples above, the group references "$1" and "$2" were used in the replaceAll() call. To use arbitrary strings as replacement, we must escape the special characters as well. This is what Matcher.quoteReplacement() does for us. In the next example, the replacement string is supposed to be $12, instead of a reference to the content of group 12:

Pattern p = Pattern.compile("PRICETAG");
Matcher m = p.matcher("This book is PRICETAG.");
String result = m.replaceAll(Matcher.quoteReplacement("$12"));
assertThat(result, is("This book is $12."));