JavaScript regular expressions: /g, /y, and .lastIndex

[2020-01-01] dev, javascript, regexp
(Ad, please don’t block)
  • Update 2020-01-02: Restructured the content to make the introduction easier to understand.

In this blog post, we examine how the RegExp flags /g and /y work and how they depend on the RegExp property .lastIndex. We’ll also discover an interesting use case for .lastIndex that you may not have considered yet.

The flags /g and /y  

These flags can be summarized as follows:

  • /g (.global) activates multi-match modes for several regular expression operations.
  • /y (.sticky) is similar to /g, but there can’t be gaps between matches.

The following two regular expression operations completely ignore /g and /y:

  • String.prototype.search(regExp)
  • String.prototype.split(regExp)

All other operations are affected by them, in some ways.

Flag /g (.global)  

Let’s see what the multi-match modes look like.

.exec() and /g  

Without /g, .exec() always returns a match object for the first match:

> const re = /#/;
> re.exec('##-#')
{ 0: '#', index: 0, input: '##-#' }
> re.exec('##-#')
{ 0: '#', index: 0, input: '##-#' }

With /g, it returns one new match per invocation and null when there are no more matches:

> const re = /#/g;
> re.exec('##-#')
{ 0: '#', index: 0, input: '##-#' }
> re.exec('##-#')
{ 0: '#', index: 1, input: '##-#' }
> re.exec('##-#')
{ 0: '#', index: 3, input: '##-#' }
> re.exec('##-#')
null

.replace() and /g  

Without /g, .replace() only replaces the first match:

> '##-#'.replace(/#/, 'x')
'x#-#'

With /g, .replace() replaces all matches:

> '##-#'.replace(/#/g, 'x')
'xx-x'

.matchAll() and /g  

.matchAll() only works if /g is set and returns all match objects:

> const re = /#/g;
> [...'##-#'.matchAll(re)]
[
  { 0: '#', index: 0, input: '##-#' },
  { 0: '#', index: 1, input: '##-#' },
  { 0: '#', index: 3, input: '##-#' },
]

Flag /y (.sticky)  

We will use /y together with /g for now (think “/g without gaps”). To understand /y on its own, we’ll need to learn about the RegExp property .lastIndex, which we’ll get to soon.

.exec() and /gy  

With /gy, each match returned by .exec() must immediately follow the previous match. That’s why it only returns two matches in the following example:

> const re = /#/gy;
> re.exec('##-#')
{ 0: '#', index: 0, input: '##-#' }
> re.exec('##-#')
{ 0: '#', index: 1, input: '##-#' }
> re.exec('##-#')
null

.replace() and /gy  

With /gy, .replace() replaces all matches, as long as there are no gaps between them:

> '##-#'.replace(/#/gy, 'x')
'xx-#'

.matchAll() and /gy  

With /gy, .matchAll() returns match objects for adjacent matches only:

> const re = /#/gy;
> [...'##-#'.matchAll(re)]
[
  { 0: '#', index: 0, input: '##-#' },
  { 0: '#', index: 1, input: '##-#' },
]

The regular expression property .lastIndex  

The regular expression property .lastIndex only has an effect if at least one of the flags /g and /y is used.

For regular expression operations that are affected by it, it controls where matching starts.

.lastIndex and /g  

For example, .exec() uses .lastIndex to remember where it currently is in the input string:

> const re = /[a-z]/g;
> re.lastIndex
0
> [re.exec('a-b'), re.lastIndex]
[{ 0: 'a', index: 0, input: 'a-b' }, 1]
> [re.exec('a-b'), re.lastIndex]
[{ 0: 'b', index: 2, input: 'a-b' }, 3]
> [re.exec('a-b'), re.lastIndex]
[ null, 0 ]

.matchAll() honors .lastIndex but does not change it:

> const re = /#/g; re.lastIndex = 1;
> [...'##-#'.matchAll(re)]
[
  { 0: '#', index: 1, input: '##-#' },
  { 0: '#', index: 3, input: '##-#' },
]
> re.lastIndex
1

.replace() ignores .lastIndex and sets it to zero:

> const re = /#/g; re.lastIndex = 1;
> '##-#'.replace(re, 'x')
'xx-x'
> re.lastIndex
0

To summarize, for several operations, /g means: Match at .lastIndex or later.

.lastIndex and /y  

For /y, .lastIndex means: Match at exactly .lastIndex. It works as if the beginning of the regular expression were anchored to .lastIndex.

Note that ^ and $ continue to work as usually: They anchor matches to the beginning or end of the input string, unless .multiline is set. Then they anchor to the beginnings or ends of lines.

.exec() matches multiple times if /y is set (even if /g is not set):

> const re = /#/y; re.lastIndex = 1;
> [re.exec('##-#'), re.lastIndex]
[{0: '#', index: 1, input: '##-#'}, 2]
> [re.exec('##-#'), re.lastIndex]
[ null, 0 ]

If /y is used without /g, then .replace() replaces the first occurrence that is found (directly) at .lastIndex. It updates .lastIndex.

> const re = /#/y; re.lastIndex = 1;
> ['##-#'.replace(re, 'x'), re.lastIndex]
[ '#x-#', 2 ]
> ['##-#'.replace(re, 'x'), re.lastIndex] // no match
[ '##-#', 0 ]
> ['##-#'.replace(re, 'x'), re.lastIndex]
[ 'x#-#', 1 ]

Pitfalls of /g and /y  

Pitfall: We can’t inline a regular expression with /g or /y  

A regular expression with /g can’t be inlined. For example, in the following while loop, the regular expression is created fresh, every time the condition is checked. Therefore, its .lastIndex is always zero and the loop never terminates.

let matchObj;
// Infinite loop
while (matchObj = /a+/g.exec('bbbaabaaa')) {
  console.log(matchObj[0]);
}

With /y, the problem is the same.

Pitfall: Removing /g or /y can break code  

If code expects a regular expression with /g and has a loop over the results of .exec() or .test(), then a regular expression without /g can cause an infinite loop:

function collectMatches(regExp, str) {
  const matches = [];
  let matchObj;
  // Infinite loop
  while (matchObj = regExp.exec(str)) {
    matches.push(matchObj[0]);
  }
  return matches;
}
collectMatches(/a+/, 'bbbaabaaa'); // Missing: flag /g

Why is there an infinity loop? Because .exec() always returns the first result, a match object, and never null.

With /y, the problem is the same.

Pitfall: Adding /g or /y can break code  

With .test(), there is another caveat: It is affected by .lastIndex. Therefore, if we want to check exactly once if a regular expression matches a string, then the regular expression must not have /g. Otherwise, we generally get a different result every time we call .test():

> const regExp = /^X/g;
> [regExp.test('Xa'), regExp.lastIndex]
[ true, 1 ]
> [regExp.test('Xa'), regExp.lastIndex]
[ false, 0 ]
> [regExp.test('Xa'), regExp.lastIndex]
[ true, 1 ]

The first invocation produces a match and updates .lastIndex. The second invocation does not find a match and resets .lastIndex to zero.

If we create a regular expression specifically for .test(), then we probably won’t add /g. However, the likeliness of encountering /g increases if we use the same regular expression for replacing and for testing.

Once again, this problem also exists with /y:

> const regExp = /^X/y;
> regExp.test('Xa')
true
> regExp.test('Xa')
false
> regExp.test('Xa')
true

Pitfall: Code can produce unexpected results if .lastIndex isn’t zero  

Given all the regular expression operations that are affected by .lastIndex, we must be careful with many algorithms that .lastIndex is zero at the beginning. Otherwise, we may get unexpected results:

function countMatches(regExp, str) {
  let count = 0;
  while (regExp.test(str)) {
    count++;
  }
  return count;
}

const myRegExp = /a/g;
myRegExp.lastIndex = 4;
assert.equal(
  countMatches(myRegExp, 'babaa'), 1); // should be 3

Normally, .lastIndex is zero in newly created regular expressions and we won’t change it explicitly like we did in the example. But .lastIndex can still end up not being zero if we use the regular expression multiple times.

Measures for avoiding the pitfalls of /g, /y, and .lastIndex  

As an example of dealing with /g and .lastIndex, we revisit countMatches() from the previous example. How do we prevent a wrong regular expression from breaking our code? Let’s look at three approaches.

Throwing exceptions  

First, we can throw an exception if /g isn’t set or .lastIndex isn’t zero:

function countMatches(regExp, str) {
  if (!regExp.global) {
    throw new Error('Flag /g of regExp must be set');
  }
  if (regExp.lastIndex !== 0) {
    throw new Error('regExp.lastIndex must be zero');
  }
  
  let count = 0;
  while (regExp.test(str)) {
    count++;
  }
  return count;
}

Cloning regular expressions  

Second, we can clone the parameter. That has the added benefit that regExp won’t be changed.

function countMatches(regExp, str) {
  const cloneFlags = regExp.flags + (regExp.global ? '' : 'g');
  const clone = new RegExp(regExp, cloneFlags);

  let count = 0;
  while (clone.test(str)) {
    count++;
  }
  return count;
}

Using an operation that isn’t affected by .lastIndex or flags  

Several regular expression operations are not affected by .lastIndex or by flags. For example, .match() ignores .lastIndex if /g is present:

function countMatches(regExp, str) {
  if (!regExp.global) {
    throw new Error('Flag /g of regExp must be set');
  }
  return (str.match(regExp) || []).length;
}

const myRegExp = /a/g;
myRegExp.lastIndex = 4;
assert.equal(countMatches(myRegExp, 'babaa'), 3); // OK!

Here, countMatches() works even though we didn’t check or fix .lastIndex.

Use case for .lastIndex: starting matching at a given index  

Apart from storing state, .lastIndex can also be used to start matching at a given index. This section describes how.

Example: Checking if a regular expression matches at a given index  

Given that .test() is affected by /y and .lastIndex, we can use it to check if a regular expression regExp matches a string str at a given index:

function matchesStringAt(regExp, str, index) {
  if (!regExp.sticky) {
    throw new Error('Flag /y of regExp must be set');
  }
  regExp.lastIndex = index;
  return regExp.test(str);
}
assert.equal(
  matchesStringAt(/x+/y, 'aaxxx', 0), false);
assert.equal(
  matchesStringAt(/x+/y, 'aaxxx', 2), true);

regExp is anchored to .lastIndex due to /y.

Note that we must not use the assertion ^ which would anchor regExp to the beginning of the input string.

Example: Finding the location of a match, starting at a given index  

.search() lets us find the location where a regular expression matches:

> '#--#'.search(/#/)
0

Alas, we can’t change where .search() starts looking for matches. As a work-around, we can use .exec() for searching:

function searchAt(regExp, str, index) {
  if (!regExp.global && !regExp.sticky) {
    throw new Error('Either flag /g or flag /y of regExp must be set');
  }
  regExp.lastIndex = index;
  const match = regExp.exec(str);
  if (match) {
    return match.index;
  } else {
    return -1;
  }
}

assert.equal(
  searchAt(/#/g, '#--#', 0), 0);
assert.equal(
  searchAt(/#/g, '#--#', 1), 3);

Example: Replacing an occurrence at a given index  

When used without /g and with /y, .replace() makes one replacement – if there is a match at .lastIndex:

function replaceOnceAt(str, regExp, replacement, index) {
  if (!(regExp.sticky && !regExp.global)) {
    throw new Error('Flag /y must be set, flag /g must not be set');
  }
  regExp.lastIndex = index;
  return str.replace(regExp, replacement);
}
assert.equal(
  replaceOnceAt('aa aaaa a', /a+/y, 'X', 0), 'X aaaa a');
assert.equal(
  replaceOnceAt('aa aaaa a', /a+/y, 'X', 3), 'aa X a');
assert.equal(
  replaceOnceAt('aa aaaa a', /a+/y, 'X', 8), 'aa aaaa X');

Summary: .global (/g) and .sticky (/y)  

The following two methods are completely unaffected by /g and /y:

  • String.prototype.search()
  • String.prototype.split()
Flag /g # .lI Flag /yg
.exec() 0+ at .lI or later ✓ upd. same as /y
 `: null MObj`
.test() 0+ at .lI or later ✓ upd. same as /y
: boolean
.replace() 1 all occurrences ✗ reset /g w/o gaps
: string
.replaceAll() 1 (same as .replace) ✗ reset /g w/o gaps
: string
.match() 1 ✗ reset /g w/o gaps
 `: null Array`
.matchAll() 1 at .lI or later ✓ unch. /g w/o gaps
: Iterable<MObj>

Legend:

  • Column “#” specifies how many results a method delivers.
  • .lI means .lastIndex
  • MObj means MatchObject
  • Is the operation affected by .lastIndex?
    • ✓ Yes: .lastIndex is either updated or unchanged.
    • ✗ No: By default, .lastIndex isn’t touched, but several operations reset it to zero.
Flag /y # Result .lI
.exec() 0+ null¦MObj at .lI ✓ updated
.test() 0+ boolean at .lI ✓ updated
.replace() 1 string occurrence at .lI ✓ updated
.replaceAll() TypeError
.match() 0+ null¦MObj (same as .exec()) ✓ updated
.matchAll() TypeError

Automatically generated result table  

I have written a small Node.js script that prints the following result table:

const s='##-#';

const r=/#/g; r.lastIndex=1;
r.exec(s)             .index=1       .lastIndex updated  
r.test(s)             true           .lastIndex updated  
s.replace(r, 'x')     "xx-x"         .lastIndex reset    
s.replaceAll(r, 'x')  "xx-x"         .lastIndex reset    
s.match(r)            ["#","#","#"]  .lastIndex reset    
s.matchAll(r)         [["#"],["#"]]  .lastIndex unchanged

const r=/#/y; r.lastIndex=1;
r.exec(s)             .index=1   .lastIndex updated  
r.test(s)             true       .lastIndex updated  
s.replace(r, 'x')     "#x-#"     .lastIndex updated  
s.replaceAll(r, 'x')  TypeError
s.match(r)            .index=1   .lastIndex updated  
s.matchAll(r)         TypeError

const r=/#/yg; r.lastIndex=1;
r.exec(s)             .index=1   .lastIndex updated  
r.test(s)             true       .lastIndex updated  
s.replace(r, 'x')     "xx-#"     .lastIndex reset    
s.replaceAll(r, 'x')  "xx-#"     .lastIndex reset    
s.match(r)            ["#","#"]  .lastIndex reset    
s.matchAll(r)         [["#"]]    .lastIndex unchanged

(Older versions of .matchAll() don’t throw a TypeError if /g is missing.)

Conclusion  

The regular expression property .lastIndex has two significant downsides:

  • It makes regular expressions stateful:
    • We now have to be mindful of the states of regular expressions and how we share them.
    • For many use cases, we can’t make them immutable via freezing, either.
  • Support for .lastIndex is inconsistent among regular expression operations.

On the upside, .lastIndex also gives us additional useful functionality: We can dictate where matching should begin (for some operations).

Further reading