ES proposal: RegExp named capture groups

(Ad, please don’t block)

The proposal “RegExp Named Capture Groups” by Daniel Ehrenberg is currently at stage 3. This blog post explains what it has to offer.

Before we get to named capture groups, let’s take a look at numbered capture groups; to introduce the idea of capture groups.

Numbered capture groups  

Numbered capture groups enable you to take apart a string with a regular expression.

Successfully matching a regular expression against a string returns a match object matchObj. Putting a fragment of the regular expression in parentheses turns that fragment into a capture group: the part of the string that it matches is stored in matchObj.

Prior to this proposal, all capture groups were accessed by number: the capture group starting with the first parenthesis via matchObj[1], the capture group starting with the second parenthesis via matchObj[2], etc.

For example, the following code shows how numbered capture groups are used to extract year, month and day from a date in ISO format:

const RE_DATE = /([0-9]{4})-([0-9]{2})-([0-9]{2})/;

const matchObj = RE_DATE.exec('1999-12-31');
const year = matchObj[1]; // 1999
const month = matchObj[2]; // 12
const day = matchObj[3]; // 31

Referring to capture groups via numbers has several disadvantages:

  1. Finding the number of a capture group is a hassle: you have to count parentheses.
  2. You need to see the regular expression if you want to understand what the groups are for.
  3. If you change the order of the capture groups, you also have to change the matching code.

All issues can be somewhat mitigated by defining constants for the numbers of the capture groups. However, capture groups are an all-around superior solution.

Named capture groups  

The proposed feature is about identifying capture groups via names:

(?<year>[0-9]{4})

Here we have tagged the previous capture group #1 with the name year. The name must be a legal JavaScript identifier (think variable name or property name). After matching, you can access the captured string via matchObj.groups.year.

The captured strings are not properties of matchObj, because you don’t want them to clash with current or future properties created by the regular expression API.

Let’s rewrite the previous code so that it uses named capture groups:

const RE_DATE = /(?<year>[0-9]{4})-(?<month>[0-9]{2})-(?<day>[0-9]{2})/;

const matchObj = RE_DATE.exec('1999-12-31');
const year = matchObj.groups.year; // 1999
const month = matchObj.groups.month; // 12
const day = matchObj.groups.day; // 31

Named capture groups also create indexed entries; as if they were numbered capture groups:

const year2 = matchObj[1]; // 1999
const month2 = matchObj[2]; // 12
const day2 = matchObj[3]; // 31

Destructuring can help with getting data out of the match object:

const {groups: {day, year}} = RE_DATE.exec('1999-12-31');
console.log(year); // 1999
console.log(day); // 31

Named capture groups have the following benefits:

  • It’s easier to find the “ID” of a capture group.
  • The matching code becomes self-descriptive, as the ID of a capture group describes what is being captured.
  • You don’t have to change the matching code if you change the order of the capture groups.
  • The names of the capture groups also make the regular expression easier to understand, as you can see directly what each group is for.

You can freely mix numbered and named capture groups.

Backreferences  

\k<name> in a regular expression means: match the string that was previously matched by the named capture group name. For example:

const RE_TWICE = /^(?<word>[a-z]+)!\k<word>$/;
RE_TWICE.test('abc!abc'); // true
RE_TWICE.test('abc!ab'); // false

The backreference syntax for numbered capture groups works for named capture groups, too:

const RE_TWICE = /^(?<word>[a-z]+)!\1$/;
RE_TWICE.test('abc!abc'); // true
RE_TWICE.test('abc!ab'); // false

You can freely mix both syntaxes:

const RE_TWICE = /^(?<word>[a-z]+)!\k<word>!\1$/;
RE_TWICE.test('abc!abc!abc'); // true
RE_TWICE.test('abc!abc!ab'); // false

replace() and named capture groups  

The string method replace() supports named capture groups in two ways.

First, you can mention their names in the replacement string:

const RE_DATE = /(?<year>[0-9]{4})-(?<month>[0-9]{2})-(?<day>[0-9]{2})/;
console.log('1999-12-31'.replace(RE_DATE,
    '$<month>/$<day>/$<year>'));
    // 12/31/1999

Second, each replacement function receives an additional parameter that holds an object with data captured via named groups. For example (line A):

const RE_DATE = /(?<year>[0-9]{4})-(?<month>[0-9]{2})-(?<day>[0-9]{2})/;
console.log('1999-12-31'.replace(RE_DATE,
    (a,y,m,d, {year, month, day}) => month+'/'+day+'/'+year)); // (A)
    // 12/31/1999

I’m not using template literals to keep things simpler, syntactically.

The first four parameters are numbered results:

  • a (index 0) contains the whole matched string, '1999-12-31'
  • y (index 1) contains the year
  • m (index 2) contains the month
  • d (index 3) contains the day

The fifth parameter is new and contains one property for each of the three named capture groups year, month and day. We use destructuring to access those properties.

The following code shows another way of accessing the last argument:

console.log('1999-12-31'.replace(RE_DATE,
    (...args) => {
        const {year, month, day} = args[args.length-1];
        return month+'/'+day+'/'+year;
    }));
    // 12/31/1999

We receive all arguments via the rest parameter args. The last element of the Array args is the object with the data from the named groups. We access it via the index args.length-1.

Named groups that don’t match  

If an optional named group does not match, its property is set to undefined (but still exists):

const RE_OPT_A = /^(?<as>a+)?$/;
const matchObj = RE_OPT_A.exec('');

// We have a match:
console.log(matchObj[0] === ''); // true

// Group <as> didn’t match anything:
console.log(matchObj.groups.as === undefined); // true

// But property `as` exists:
console.log('as' in matchObj.groups); // true

Implementations  

The relevant V8 is not yet in Node.js (7.10.0). You can check via:

node -p process.versions.v8

In Chrome Canary (60.0+), you can enable named capture groups as follows. First, look up the path of the Chrome Canary binary via the about: URL. Then start Canary like this (you only need the double quotes if the path contains a space):

$ alias canary='"/tmp/Google Chrome Canary.app/Contents/MacOS/Google Chrome Canary"'
$ canary --js-flags='--harmony-regexp-named-captures'

Further reading