ECMAScript proposal: RegExp match indices

[2019-12-22] dev, javascript, es proposal
(Ad, please don’t block)
  • Update 2021-03-17: Important change: RegExp match indices must now explicitly be switched on via the RegExp flag /d (.hasIndices).
    • RegExp match indices are supported by V8 v9.0 and later (source).

The ECMAScript proposal “RegExp match indices” (by Ron Buckton) adds more information to regular expression match objects (as returned by RegExp.prototype.exec() and other methods): They now record for each captured group where it starts and where it ends. Read on for more information.

Match objects  

The following methods match regular expressions against strings and return match objects if they succeed:

  • RegExp.prototype.exec() returns null or single match objects.
  • String.prototype.match() returns null or single match objects (if flag /g is not set).
  • String.prototype.matchAll() returns an iterable of match objects (flag /g must be set; otherwise, an exception is thrown).

The key responsibility of a match object is to store the captures that were made by groups.

Numbered groups  

This is how we access what the numbered groups of a regular expression captured:

const matchObj = /(a+)(b+)/.exec('aaaabb');
assert.equal(
  matchObj[1], 'aaaa');
assert.equal(
  matchObj[2], 'bb');

The proposal now also gives us the start and end indices of what was matched, via matchObj.indices:

assert.deepEqual(
  matchObj.indices[1], [0, 4]);
assert.deepEqual(
  matchObj.indices[2], [4, 6]);

Named groups  

The captures of named groups are accessed likes this:

const matchObj = /(?<as>a+)(?<bs>b+)/.exec('aaaabb');
assert.equal(
  matchObj.groups.as, 'aaaa');
assert.equal(
  matchObj.groups.bs, 'bb');

Their indices are stored in matchObj.indices.groups:

assert.deepEqual(
  matchObj.indices.groups.as, [0, 4]);
assert.deepEqual(
  matchObj.indices.groups.bs, [4, 6]);

Multiple match objects  

String.prototype.matchAll() returns an iterable of match objects. With the proposal, each one includes indices. Note that we are using named groups, but each named group also leads to a numbered capture.

assert.deepEqual(
  [...'aabbb aaaab'.matchAll(/(?<as>a+)(?<bs>b+)/g)],
  [
    {
      0: 'aabbb',
      1: 'aa',
      2: 'bbb',
      index: 0,
      input: 'aabbb aaaab',
      groups: { as: 'aa', bs: 'bbb' },
      indices: {
        0: [0, 5],
        1: [0, 2],
        2: [2, 5],
        groups: { as: [0,2], bs: [2,5] },
      },
    },
    {
      0: 'aaaab',
      1: 'aaaa',
      2: 'b',
      index: 6,
      input: 'aabbb aaaab',
      groups: { as: 'aaaa', bs: 'b' },
      indices: {
        0: [6, 11],
        1: [6, 10],
        2: [10, 11],
        groups: { as: [6,10], bs: [10,11] },
      },
    },
  ]
);

A realistic example  

One important use case for match indices are parsers that point to where exactly a syntactic error is located. The following code solves a related problem: It points to where quoted content starts and where it ends (see demonstration at the end).

const reQuoted = /“([^”]+)”/ug;
function pointToQuotedText(str) {
  const startIndices = new Set();
  const endIndices = new Set();
  for (const match of str.matchAll(reQuoted)) {
    const [start, end] = match.indices[1];
    // Without proposal:
    // const [start, end] = [
    //   match.index+1,
    //   match.index+match[0].length-1,
    // ]; 
    startIndices.add(start);
    endIndices.add(end);
  }
  let result = '';
  for (let index=0; index < str.length; index++) {
    if (startIndices.has(index)) {
      result += '[';
    } else if (endIndices.has(index+1)) {
      result += ']';
    } else {
      result += ' ';
    }
  }
  return result;
}
assert.equal(
  pointToQuotedText(
    'They said “hello” and “goodbye”.'),
    '           [   ]       [     ]  '
);

Note that we do not handle Unicode graphemes with more than one code unit correctly. That is, this code only works when each printable glyph is represented by one code unit (JavaScript character).

Support for this feature  

Further reading