ECMAScript proposal: RegExp match indices

[2019-12-22] dev, javascript, es proposal

Update 2021-03-17: Important change: RegExp match indices must now explicitly be switched on via the RegExp flag /d (.hasIndices).
- RegExp match indices are supported by V8 v9.0 and later (source).

The ECMAScript proposal “RegExp match indices” (by Ron Buckton) adds more information to regular expression match objects (as returned by RegExp.prototype.exec() and other methods): They now record for each captured group where it starts and where it ends. Read on for more information.

Match objects

The following methods match regular expressions against strings and return match objects if they succeed:

RegExp.prototype.exec() returns null or single match objects.
String.prototype.match() returns null or single match objects (if flag /g is not set).
String.prototype.matchAll() returns an iterable of match objects (flag /g must be set; otherwise, an exception is thrown).

The key responsibility of a match object is to store the captures that were made by groups.

Numbered groups

This is how we access what the numbered groups of a regular expression captured:

const matchObj = /(a+)(b+)/.exec('aaaabb');
assert.equal(
  matchObj[1], 'aaaa');
assert.equal(
  matchObj[2], 'bb');

The proposal now also gives us the start and end indices of what was matched, via matchObj.indices:

assert.deepEqual(
  matchObj.indices[1], [0, 4]);
assert.deepEqual(
  matchObj.indices[2], [4, 6]);

Named groups

The captures of named groups are accessed likes this:

const matchObj = /(?<as>a+)(?<bs>b+)/.exec('aaaabb');
assert.equal(
  matchObj.groups.as, 'aaaa');
assert.equal(
  matchObj.groups.bs, 'bb');

Their indices are stored in matchObj.indices.groups:

assert.deepEqual(
  matchObj.indices.groups.as, [0, 4]);
assert.deepEqual(
  matchObj.indices.groups.bs, [4, 6]);

Multiple match objects

String.prototype.matchAll() returns an iterable of match objects. With the proposal, each one includes indices. Note that we are using named groups, but each named group also leads to a numbered capture.

assert.deepEqual(
  [...'aabbb aaaab'.matchAll(/(?<as>a+)(?<bs>b+)/g)],
  [
    {
      0: 'aabbb',
      1: 'aa',
      2: 'bbb',
      index: 0,
      input: 'aabbb aaaab',
      groups: { as: 'aa', bs: 'bbb' },
      indices: {
        0: [0, 5],
        1: [0, 2],
        2: [2, 5],
        groups: { as: [0,2], bs: [2,5] },
      },
    },
    {
      0: 'aaaab',
      1: 'aaaa',
      2: 'b',
      index: 6,
      input: 'aabbb aaaab',
      groups: { as: 'aaaa', bs: 'b' },
      indices: {
        0: [6, 11],
        1: [6, 10],
        2: [10, 11],
        groups: { as: [6,10], bs: [10,11] },
      },
    },
  ]
);

A realistic example

One important use case for match indices are parsers that point to where exactly a syntactic error is located. The following code solves a related problem: It points to where quoted content starts and where it ends (see demonstration at the end).

const reQuoted = /“([^”]+)”/ug;
function pointToQuotedText(str) {
  const startIndices = new Set();
  const endIndices = new Set();
  for (const match of str.matchAll(reQuoted)) {
    const [start, end] = match.indices[1];
    // Without proposal:
    // const [start, end] = [
    //   match.index+1,
    //   match.index+match[0].length-1,
    // ]; 
    startIndices.add(start);
    endIndices.add(end);
  }
  let result = '';
  for (let index=0; index < str.length; index++) {
    if (startIndices.has(index)) {
      result += '[';
    } else if (endIndices.has(index+1)) {
      result += ']';
    } else {
      result += ' ';
    }
  }
  return result;
}
assert.equal(
  pointToQuotedText(
    'They said “hello” and “goodbye”.'),
    '           [   ]       [     ]  '
);

Note that we do not handle Unicode graphemes with more than one code unit correctly. That is, this code only works when each printable glyph is represented by one code unit (JavaScript character).

Support for this feature

The npm package regexp-match-indices provides a polyfill.
V8 issue: “Implement RegExp match offsets proposal”
- You can activate the experimental implementation in Node.js via the flag --harmony-regexp-match-indices.

ECMAScript proposal: RegExp match indices

Match objects #

Numbered groups #

Named groups #

Multiple match objects #

A realistic example #

Support for this feature #

Further reading #