These are lookaround assertions in regular expressions in JavaScript:
(?=«pattern»)
(?!«pattern»)
(?<=«pattern»)
(?<!«pattern»)
This blog post shows examples of using them.
At the current location in the input string:
(?=«pattern»)
matches if pattern
matches what comes after the current location.(?!«pattern»)
matches if pattern
does not match what comes after the current location.(?<=«pattern»)
matches if pattern
matches what comes before the current location.(?<!«pattern»)
matches if pattern
does not match what comes before the current location.For more information, see “JavaScript for impatient programmers”: lookahead assertions, lookbehind assertions.
Regular expressions are a double-edged sword: powerful and short, but also sloppy and cryptic. Sometimes different, longer approaches (especially proper parsers) may be better, especially for production code.
Another caveat is that lookbehind assertions are a relatively new feature that may not be supported by all JavaScript engines you are targeting.
In the following interaction, we extract quoted words:
> 'how "are" "you" doing'.match(/(?<=")[a-z]+(?=")/g)
[ 'are', 'you' ]
Two lookaround assertions help us here:
(?<=")
“must be preceded by a quote”(?=")
“must be followed by a quote”Lookaround assertions are especially convenient for .match()
in its /g
mode, which returns whole matches (capture group 0). Whatever the pattern of a lookaround assertion matches is not captured. Without lookaround assertions, the quotes show up in the result:
> 'how "are" "you" doing'.match(/"([a-z]+)"/g)
[ '"are"', '"you"' ]
How can we achieve the opposite of what we did in the previous section and extract all unquoted words from a string?
'how "are" "you" doing'
['how', 'doing']
Our first attempt is to simply convert positive lookaround assertions to negative lookaround assertions. Alas, that fails:
> 'how "are" "you" doing'.match(/(?<!")[a-z]+(?!")/g)
[ 'how', 'r', 'o', 'doing' ]
The problem is that we extract sequences of characters that are not bracketed by quotes. That means that in the string '"are"'
, the “r” in the middle is considered unquoted, because it is preceded by an “a” and followed by an “e”.
We can fix this by stating that prefix and suffix must be neither quote nor letter:
> 'how "are" "you" doing'.match(/(?<!["a-z])[a-z]+(?!["a-z])/g)
[ 'how', 'doing' ]
Another solution is to demand via \b
that the sequence of characters [a-z]+
start and end at word boundaries:
> 'how "are" "you" doing'.match(/(?<!")\b[a-z]+\b(?!")/g)
[ 'how', 'doing' ]
One thing that is nice about negative lookbehind and negative lookahead is that they also work at the beginning or end, respectively, of a string – as demonstrated in the example.
Negative lookaround assertions are a powerful tool and difficult to emulate via other (regular expression) means.
If you don’t want to use them, you normally have to take completely different approach. For example, in this case, you could split the string into (quoted and unquoted) words and then filter those:
const str = 'how "are" "you" doing';
const allWords = str.match(/"?[a-z]+"?/g);
const unquotedWords = allWords.filter(w => !w.startsWith('"') || !w.endsWith('"'));
assert.deepEqual(unquotedWords, ['how', 'doing']);
Benefits of this approach:
All of the examples we have seen so far have in common that the lookaround assertions dictate what must come before or after the match but without including those characters in the match.
The regular expressions shown in the remainder of this blog post are different: Their lookaround assertions point inward and restrict what’s inside the match.
'abc'
Let‘s assume we want to match all strings that do not start with 'abc'
. Our first attempt could be the regular expression /^(?!abc)/
.
That works well for .test()
:
> /^(?!abc)/.test('xyz')
true
However, .exec()
gives us an empty string:
> /^(?!abc)/.exec('xyz')
{ 0: '', index: 0, input: 'xyz', groups: undefined }
The problem is that assertions such as lookaround assertions don’t expand the matched text. That is, they don’t capture input characters, they only make demands about the current location in the input.
Therefore, the solution is to add a pattern that does capture input characters:
> /^(?!abc).*$/.exec('xyz')
{ 0: 'xyz', index: 0, input: 'xyz', groups: undefined }
As desired, this new regular expression rejects strings that are prefixed with 'abc'
:
> /^(?!abc).*$/.exec('abc')
null
> /^(?!abc).*$/.exec('abcd')
null
And it accepts strings that don’t have the full prefix:
> /^(?!abc).*$/.exec('ab')
{ 0: 'ab', index: 0, input: 'ab', groups: undefined }
'.mjs'
In the following example, we want to find
import ··· from '«module-specifier»';
where module-specifier
does not end with '.mjs'
.
const code = `
import {transform} from './util';
import {Person} from './person.mjs';
import {zip} from 'lodash';
`.trim();
assert.deepEqual(
code.match(/^import .*? from '[^']+(?<!\.mjs)';$/umg),
[
"import {transform} from './util';",
"import {zip} from 'lodash';",
]);
Here, the lookbehind assertion (?<!\.mjs)
acts as a guard and prevents that the regular expression matches strings that contain '.mjs
' at this location.
Scenario: We want to parse lines with settings, while skipping comments. For example:
const RE_SETTING = /^(?!#)([^:]*):(.*)$/
const lines = [
'indent: 2', // setting
'# Trim trailing whitespace:', // comment
'whitespace: trim', // setting
];
for (const line of lines) {
const match = RE_SETTING.exec(line);
if (match) {
const key = JSON.stringify(match[1]);
const value = JSON.stringify(match[2]);
console.log(`KEY: ${key} VALUE: ${value}`);
}
}
// Output:
// 'KEY: "indent" VALUE: " 2"'
// 'KEY: "whitespace" VALUE: " trim"'
How did we arrive at the regular expression RE_SETTING
?
We started with the following regular expression for settings:
/^([^:]*):(.*)$/
Intuitively, it is a sequence of the following parts:
This regular expression does reject some comments:
> /^([^:]*):(.*)$/.test('# Comment')
false
But it accepts others (that have colons in them):
> /^([^:]*):(.*)$/.test('# Comment:')
true
We can fix that by prefixing (?!#)
as a guard. Intuitively, it means: ”The current location in the input string must not be followed by the character #
.”
The new regular expression works as desired:
> /^(?!#)([^:]*):(.*)$/.test('# Comment:')
false
Let’s assume we want to convert pairs of straight double quotes to curly quotes:
`"yes" and "no"`
`“yes” and “no”`
This is our first attempt:
> `The words "must" and "should".`.replace(/"(.*)"/g, '“$1”')
'The words “must" and "should”.'
Only the first quote and the last quote is curly. The problem here is that the *
quantifier matches greedily (as much as possible).
If we put a question mark after the *
, it matches reluctantly:
> `The words "must" and "should".`.replace(/"(.*?)"/g, '“$1”')
'The words “must” and “should”.'
What if we want to allow the escaping of quotes via backslashes? We can do that by using the guard (?<!\\)
before the quotes:
> String.raw`\"stright\" and "curly"`.replace(/(?<!\\)"(.*?)(?<!\\)"/g, '“$1”')
'\\"stright\\" and “curly”'
As a post-processing step, we would still need to do:
.replace(/\\"/g, `"`)
However, this regular expression can fail when there is a backslash-escaped backslash:
> String.raw`Backslash: "\\"`.replace(/(?<!\\)"(.*?)(?<!\\)"/g, '“$1”')
'Backslash: "\\\\"'
The second backslash prevented the quotes from becoming curly.
We can fix that if we make our guard more sophisticated (?:
makes the group non-capturing):
(?<=[^\\](?:\\\\)*)
(Credit: @jonasraoni)
The new guard allows pairs of backslashes before quotes:
> String.raw`Backslash: "\\"`.replace(/(?<=[^\\](?:\\\\)*)"(.*?)(?<=[^\\](?:\\\\)*)"/g, '“$1”')
'Backslash: “\\\\”'
One issue remains. This guard prevents the first quote from being matched if it appears at the beginning of a string:
> `"abc"`.replace(/(?<=[^\\](?:\\\\)*)"(.*?)(?<=[^\\](?:\\\\)*)"/g, '“$1”')
'"abc"'
We can fix that by changing the first guard to: (?<=[^\\](?:\\\\)*|^)
> `"abc"`.replace(/(?<=[^\\](?:\\\\)*|^)"(.*?)(?<=[^\\](?:\\\\)*)"/g, '“$1”')
'“abc”'
RegExp
)” in “JavaScript for impatient programmers”