The ECMAScript proposal “RegExp escaping” (by Jordan Harband and Kevin Gibbons) specifies a function RegExp.escape()
that, given a string text
, creates an escaped version that matches text
– if interpreted as a regular expression.
This proposal is currently at stage 3.
RegExp.escape()
work? For a string text
, RegExp.escape(text)
creates a regular expression pattern that matches text
.
Characters that have special meaning in regular expressions can’t be used verbatim and have to be escaped:
> RegExp.escape('(*)')
'\\(\\*\\)'
Note that we see each regular expression backslash twice: One of them is the actual backslash, the other one escapes it inside the string literal:
> '\\(\\*\\)' === String.raw`\(\*\)`
true
Characters that have no special meaning don’t have to be escaped:
> RegExp.escape('_abc123')
'_abc123'
RegExp.escape()
? The classic use case for escaping was searching and replacing text:
function replacePlainText(str, searchText, replace) {
const searchRegExp = new RegExp(
RegExp.escape(searchText),
'gu'
);
return str.replace(searchRegExp, replace)
}
assert.equal(
replacePlainText('(a) and (a)', '(a)', '@'),
'@ and @'
);
However, since ES2021, we have .replaceAll()
:
assert.equal(
'(a) and (a)'.replaceAll('(a)', '@'),
'@ and @'
);
function removeUnquotedText(str, text) {
const regExp = new RegExp(
`(?<!“)${RegExp.escape(text)}(?!”)`,
'gu'
);
return str.replaceAll(regExp, '•');
}
assert.equal(
removeUnquotedText('“yes” and yes and “yes”', 'yes'),
'“yes” and • and “yes”'
);
The same approach can also be used to find or count unquoted text.
Any given pattern returned by RegExp.escape()
may exist for a long time. Therefore, it is important that future regular expression features don’t prevent the pattern from working. That’s why RegExp.escape()
doesn’t just escape punctuation characters that are in use today as special syntax, it also escapes characters that may become syntax in the future.
Furthermore, escaped text should always work: No matter which flags are active and no matter where it is inserted. We’ll examine next how that influences the output of RegExp.escape()
.
One interesting example is the upcoming flag /x
which ignores unescaped whitespace. Therefore, whitespace must be escaped:
> RegExp.escape(' \t')
'\\x20\\t'
We want escaped characters to be as short as possible. Alas, we can’t use any features that are enabled via the flags /u
and /v
. That leaves us with:
The following escapes take care of some whitespace and line terminator characters (we’ll see soon why the latter have to be escaped):
\t \n \v \f \r
For Unicode code points up to 0xFF, we can use an (ASCII) hex escape – e.g.: \x41
matches A
.
For Unicode code points up to 0xFFFF, we can use Unicode code unit escapes – e.g.: \u2028
matches the Unicode character LINE SEPARATOR.
For higher code points, we can’t use code point escapes such as \u{1F44D}
because those are only supported with flag /u
or /v
. We have to use two code unit escapes. For now that’s not necessary but in the future, we may have to escape characters outside the Basic Multilingual Plane.
In a regular expression, there can be many “contexts” (think nested scopes) – e.g.:
*
and $
have to be escaped if we want to match them.[abc]
):
*
and $
.-
(hyphen)./v
, several double punctuators have to be escaped – e.g. &&
and --
. That is done by escaping both characters.\q{}
is yet another context. Inside a character class, it adds one or more sequences of code points to the class.Consequences for escaping:
The upcoming flag /x
supports line comments via #
(i.e., a new context). Therefore, line terminators must be escaped (actual newline becomes escaped newline):
> RegExp.escape('\n')
'\\n'
The following characters are RegExp top-level syntax and can be escaped with a backslash:
^ $ \ . * + ? ( ) [ ] { } |
Example:
> RegExp.escape('$')
'\\$'
Other punctuation characters are only syntax in some contexts – either now or, potentially, in the future. However, most of them can’t be escaped with a backslash if the flags include /u
or /v
. Therefore, escaping uses hex escapes in these cases (which are shorter than Unicode code unit escapes).
, - = < > # & ! % : ; @ ~ ' ` "
Example:
> RegExp.escape('=>')
'\\x3d\\x3e'
Some regular expressions are constructed like this:
new RegExp('<regex pattern>' + RegExp.escape(text))
We don’t want the result of RegExp.escape()
to affect the regular expression pattern that comes before it:
\0
represents the NULL character (U+0000) and must not be followed by a decimal digit. That’s why initial decimal digits are escaped:
> RegExp.escape('123')
'\\x3123'
\1
, \2
, etc. are backreferences to numbered capture groups. An escaped text should not add decimal digits to them – which is taken care of by escaping initial decimal digits (see previous item).
Control characters can be represented like this: \cA
(Ctrl-A), ..., \cZ
(Ctrl-Z). However, \c
can also be used on its own – in which case it is interpreted verbatim (source). Therefore, an escaped text must not start with an ASCII letter:
> RegExp.escape('abc')
'\\x61bc'
Flag /u
and /v
handle surrogate pairs as units. Therefore, we must escape lone surrogates so that they are not combined with preceding or succeeding lone surrogates in a regular expression pattern.
RegExp.escape()
RegExp
)” in “Exploring JavaScript”.