Regular expressions in JavaScript: lookaround assertions by example

[2019-10-27] dev, javascript, regexp
(Ad, please don’t block)

These are lookaround assertions in regular expressions in JavaScript:

  • Positive lookahead: (?=«pattern»)
  • Negative lookahead: (?!«pattern»)
  • Positive lookbehind: (?<=«pattern»)
  • Negative lookbehind: (?<!«pattern»)

This blog post shows examples of using them.

Cheat sheet: lookaround assertions  

At the current location in the input string:

  • Lookahead assertions (ECMAScript 3):
    • Positive lookahead: (?=«pattern») matches if pattern matches what comes after the current location.
    • Negative lookahead: (?!«pattern») matches if pattern does not match what comes after the current location.
  • Lookbehind assertions (ECMAScript 2018):
    • Positive lookbehind: (?<=«pattern») matches if pattern matches what comes before the current location.
    • Negative lookbehind: (?<!«pattern») matches if pattern does not match what comes before the current location.

For more information, see “JavaScript for impatient programmers”: lookahead assertions, lookbehind assertions.

A word of caution about regular expressions  

Regular expressions are a double-edged sword: powerful and short, but also sloppy and cryptic. Sometimes different, longer approaches (especially proper parsers) may be better, especially for production code.

Another caveat is that lookbehind assertions are a relatively new feature that may not be supported by all JavaScript engines you are targeting.

Example: Specifying what comes before or after a match (positive lookaround)  

In the following interaction, we extract quoted words:

> 'how "are" "you" doing'.match(/(?<=")[a-z]+(?=")/g)
[ 'are', 'you' ]

Two lookaround assertions help us here:

  • (?<=") “must be preceded by a quote”
  • (?=") “must be followed by a quote”

Lookaround assertions are especially convenient for .match() in its /g mode, which returns whole matches (capture group 0). Whatever the pattern of a lookaround assertion matches is not captured. Without lookaround assertions, the quotes show up in the result:

> 'how "are" "you" doing'.match(/"([a-z]+)"/g)
[ '"are"', '"you"' ]

Example: Specifying what does not come before or after a match (negative lookaround)  

How can we achieve the opposite of what we did in the previous section and extract all unquoted words from a string?

  • Input: 'how "are" "you" doing'
  • Output: ['how', 'doing']

Our first attempt is to simply convert positive lookaround assertions to negative lookaround assertions. Alas, that fails:

> 'how "are" "you" doing'.match(/(?<!")[a-z]+(?!")/g)
[ 'how', 'r', 'o', 'doing' ]

The problem is that we extract sequences of characters that are not bracketed by quotes. That means that in the string '"are"', the “r” in the middle is considered unquoted, because it is preceded by an “a” and followed by an “e”.

We can fix this by stating that prefix and suffix must be neither quote nor letter:

> 'how "are" "you" doing'.match(/(?<!["a-z])[a-z]+(?!["a-z])/g)
[ 'how', 'doing' ]

Another solution is to demand via \b that the sequence of characters [a-z]+ start and end at word boundaries:

> 'how "are" "you" doing'.match(/(?<!")\b[a-z]+\b(?!")/g)
[ 'how', 'doing' ]

One thing that is nice about negative lookbehind and negative lookahead is that they also work at the beginning or end, respectively, of a string – as demonstrated in the example.

There are no simple alternatives to negative lookaround assertions  

Negative lookaround assertions are a powerful tool and difficult to emulate via other (regular expression) means.

If you don’t want to use them, you normally have to take completely different approach. For example, in this case, you could split the string into (quoted and unquoted) words and then filter those:

const str = 'how "are" "you" doing';

const allWords = str.match(/"?[a-z]+"?/g);
const unquotedWords = allWords.filter(w => !w.startsWith('"') || !w.endsWith('"'));
assert.deepEqual(unquotedWords, ['how', 'doing']);

Benefits of this approach:

  • It works on older engines.
  • It is easy to understand.

Interlude: pointing lookaround assertions inward  

All of the examples we have seen so far have in common that the lookaround assertions dictate what must come before or after the match but without including those characters in the match.

The regular expressions shown in the remainder of this blog post are different: Their lookaround assertions point inward and restrict what’s inside the match.

Example: match strings not starting with 'abc'  

Let‘s assume we want to match all strings that do not start with 'abc'. Our first attempt could be the regular expression /^(?!abc)/.

That works well for .test():

> /^(?!abc)/.test('xyz')
true

However, .exec() gives us an empty string:

> /^(?!abc)/.exec('xyz')
{ 0: '', index: 0, input: 'xyz', groups: undefined }

The problem is that assertions such as lookaround assertions don’t expand the matched text. That is, they don’t capture input characters, they only make demands about the current location in the input.

Therefore, the solution is to add a pattern that does capture input characters:

> /^(?!abc).*$/.exec('xyz')
{ 0: 'xyz', index: 0, input: 'xyz', groups: undefined }

As desired, this new regular expression rejects strings that are prefixed with 'abc':

> /^(?!abc).*$/.exec('abc')
null
> /^(?!abc).*$/.exec('abcd')
null

And it accepts strings that don’t have the full prefix:

> /^(?!abc).*$/.exec('ab')
{ 0: 'ab', index: 0, input: 'ab', groups: undefined }

Example: match substrings that do not contain '.mjs'  

In the following example, we want to find

import ··· from '«module-specifier»';

where module-specifier does not end with '.mjs'.

const code = `
import {transform} from './util';
import {Person} from './person.mjs';
import {zip} from 'lodash';
`.trim();
assert.deepEqual(
  code.match(/^import .*? from '[^']+(?<!\.mjs)';$/umg),
  [
    "import {transform} from './util';",
    "import {zip} from 'lodash';",
  ]);

Here, the lookbehind assertion (?<!\.mjs) acts as a guard and prevents that the regular expression matches strings that contain '.mjs' at this location.

Example: skipping lines with comments  

Scenario: We want to parse lines with settings, while skipping comments. For example:

const RE_SETTING = /^(?!#)([^:]*):(.*)$/

const lines = [
  'indent: 2', // setting
  '# Trim trailing whitespace:', // comment
  'whitespace: trim', // setting
];
for (const line of lines) {
  const match = RE_SETTING.exec(line);
  if (match) {
    const key = JSON.stringify(match[1]);
    const value = JSON.stringify(match[2]);
    console.log(`KEY: ${key} VALUE: ${value}`);
  }
}

// Output:
// 'KEY: "indent" VALUE: " 2"'
// 'KEY: "whitespace" VALUE: " trim"'

How did we arrive at the regular expression RE_SETTING?

We started with the following regular expression for settings:

/^([^:]*):(.*)$/

Intuitively, it is a sequence of the following parts:

  • Start of the line
  • Non-colons (zero or more)
  • A single colon
  • Any characters (zero or more)
  • The end of line

This regular expression does reject some comments:

> /^([^:]*):(.*)$/.test('# Comment')
false

But it accepts others (that have colons in them):

> /^([^:]*):(.*)$/.test('# Comment:')
true

We can fix that by prefixing (?!#) as a guard. Intuitively, it means: ”The current location in the input string must not be followed by the character #.”

The new regular expression works as desired:

> /^(?!#)([^:]*):(.*)$/.test('# Comment:')
false

Example: smart quotes  

Let’s assume we want to convert pairs of straight double quotes to curly quotes:

  • Input: `"yes" and "no"`
  • Output: `“yes” and “no”`

This is our first attempt:

> `The words "must" and "should".`.replace(/"(.*)"/g, '“$1”')
'The words “must" and "should”.'

Only the first quote and the last quote is curly. The problem here is that the * quantifier matches greedily (as much as possible).

If we put a question mark after the *, it matches reluctantly:

> `The words "must" and "should".`.replace(/"(.*?)"/g, '“$1”')
'The words “must” and “should”.'

Supporting escaping via backslashes  

What if we want to allow the escaping of quotes via backslashes? We can do that by using the guard (?<!\\) before the quotes:

> String.raw`\"stright\" and "curly"`.replace(/(?<!\\)"(.*?)(?<!\\)"/g, '“$1”')
'\\"stright\\" and “curly”'

As a post-processing step, we would still need to do:

.replace(/\\"/g, `"`)

However, this regular expression can fail when there is a backslash-escaped backslash:

> String.raw`Backslash: "\\"`.replace(/(?<!\\)"(.*?)(?<!\\)"/g, '“$1”')
'Backslash: "\\\\"'

The second backslash prevented the quotes from becoming curly.

We can fix that if we make our guard more sophisticated (?: makes the group non-capturing):

(?<=[^\\](?:\\\\)*)

(Credit: @jonasraoni)

The new guard allows pairs of backslashes before quotes:

> String.raw`Backslash: "\\"`.replace(/(?<=[^\\](?:\\\\)*)"(.*?)(?<=[^\\](?:\\\\)*)"/g, '“$1”')
'Backslash: “\\\\”'

One issue remains. This guard prevents the first quote from being matched if it appears at the beginning of a string:

> `"abc"`.replace(/(?<=[^\\](?:\\\\)*)"(.*?)(?<=[^\\](?:\\\\)*)"/g, '“$1”')
'"abc"'

We can fix that by changing the first guard to: (?<=[^\\](?:\\\\)*|^)

> `"abc"`.replace(/(?<=[^\\](?:\\\\)*|^)"(.*?)(?<=[^\\](?:\\\\)*)"/g, '“$1”')
'“abc”'

Further reading