ECMAScript proposal: RegExp escaping

[2025-01-21] dev, javascript, es proposal
(Ad, please don’t block)

The ECMAScript proposal “RegExp escaping” (by Jordan Harband and Kevin Gibbons) specifies a function RegExp.escape() that, given a string text, creates an escaped version that matches text – if interpreted as a regular expression.

This proposal is currently at stage 3.

How does RegExp.escape() work?  

For a string text, RegExp.escape(text) creates a regular expression pattern that matches text.

Characters that have special meaning in regular expressions can’t be used verbatim and have to be escaped:

> RegExp.escape('(*)')
'\\(\\*\\)'

Note that we see each regular expression backslash twice: One of them is the actual backslash, the other one escapes it inside the string literal:

> '\\(\\*\\)' === String.raw`\(\*\)`
true

Characters that have no special meaning don’t have to be escaped:

> RegExp.escape('_abc123')
'_abc123'

What are the use cases for RegExp.escape()?  

Example: replacing all occurrences of a text  

The classic use case for escaping was searching and replacing text:

function replacePlainText(str, searchText, replace) {
  const searchRegExp = new RegExp(
    RegExp.escape(searchText),
    'gu'
  );
  return str.replace(searchRegExp, replace)
}
assert.equal(
  replacePlainText('(a) and (a)', '(a)', '@'),
  '@ and @'
);

However, since ES2021, we have .replaceAll():

assert.equal(
  '(a) and (a)'.replaceAll('(a)', '@'),
  '@ and @'
);

Example: part of a regular expression must match a given text  

function removeUnquotedText(str, text) {
  const regExp = new RegExp(
    `(?<!“)${RegExp.escape(text)}(?!”)`,
    'gu'
  );
  return str.replaceAll(regExp, '•');
}
assert.equal(
  removeUnquotedText('“yes” and yes and “yes”', 'yes'),
  '“yes” and • and “yes”'
);

The same approach can also be used to find or count unquoted text.

Considerations for escaping  

Any given pattern returned by RegExp.escape() may exist for a long time. Therefore, it is important that future regular expression features don’t prevent the pattern from working. That’s why RegExp.escape() doesn’t just escape punctuation characters that are in use today as special syntax, it also escapes characters that may become syntax in the future.

Furthermore, escaped text should always work: No matter which flags are active and no matter where it is inserted. We’ll examine next how that influences the output of RegExp.escape().

Escaping must work for all flags  

One interesting example is the upcoming flag /x which ignores unescaped whitespace. Therefore, whitespace must be escaped:

> RegExp.escape(' \t')
'\\x20\\t'

We want escaped characters to be as short as possible. Alas, we can’t use any features that are enabled via the flags /u and /v. That leaves us with:

  • The following escapes take care of some whitespace and line terminator characters (we’ll see soon why the latter have to be escaped):

    \t \n \v \f \r
    
  • For Unicode code points up to 0xFF, we can use an (ASCII) hex escape – e.g.: \x41 matches A.

  • For Unicode code points up to 0xFFFF, we can use Unicode code unit escapes – e.g.: \u2028 matches the Unicode character LINE SEPARATOR.

  • For higher code points, we can’t use code point escapes such as \u{1F44D} because those are only supported with flag /u or /v. We have to use two code unit escapes. For now that’s not necessary but in the future, we may have to escape characters outside the Basic Multilingual Plane.

Escaping must work in all syntactic contexts  

In a regular expression, there can be many “contexts” (think nested scopes) – e.g.:

  • At the top level, syntax characters such as * and $ have to be escaped if we want to match them.
  • In a character class (such as [abc]):
    • Much top-level syntax does not have to be escaped – e.g. * and $.
    • Some other syntax does have to be escaped – e.g. - (hyphen).
    • With flag /v, several double punctuators have to be escaped – e.g. && and --. That is done by escaping both characters.
  • The class string disjunction \q{} is yet another context. Inside a character class, it adds one or more sequences of code points to the class.

Consequences for escaping:

  • The upcoming flag /x supports line comments via # (i.e., a new context). Therefore, line terminators must be escaped (actual newline becomes escaped newline):

    > RegExp.escape('\n')
    '\\n'
    
  • The following characters are RegExp top-level syntax and can be escaped with a backslash:

    ^ $ \ . * + ? ( ) [ ] { } |
    

    Example:

    > RegExp.escape('$')
    '\\$'
    
  • Other punctuation characters are only syntax in some contexts – either now or, potentially, in the future. However, most of them can’t be escaped with a backslash if the flags include /u or /v. Therefore, escaping uses hex escapes in these cases (which are shorter than Unicode code unit escapes).

    , - = < > # & ! % : ; @ ~ ' ` "
    

    Example:

    > RegExp.escape('=>')
    '\\x3d\\x3e'
    

Escaping must work whatever syntax precedes or succeeds the escaped text  

Some regular expressions are constructed like this:

new RegExp('<regex pattern>' + RegExp.escape(text))

We don’t want the result of RegExp.escape() to affect the regular expression pattern that comes before it:

  • \0 represents the NULL character (U+0000) and must not be followed by a decimal digit. That’s why initial decimal digits are escaped:

    > RegExp.escape('123')
    '\\x3123'
    
  • \1, \2, etc. are backreferences to numbered capture groups. An escaped text should not add decimal digits to them – which is taken care of by escaping initial decimal digits (see previous item).

  • Control characters can be represented like this: \cA (Ctrl-A), ..., \cZ (Ctrl-Z). However, \c can also be used on its own – in which case it is interpreted verbatim (source). Therefore, an escaped text must not start with an ASCII letter:

    > RegExp.escape('abc')
    '\\x61bc'
    
  • Flag /u and /v handle surrogate pairs as units. Therefore, we must escape lone surrogates so that they are not combined with preceding or succeeding lone surrogates in a regular expression pattern.

Implementations of RegExp.escape()  

  • Support by various JavaScript platforms: see MDN
  • Polyfill on npm by Jordan Harband
  • I have written an implementation for purely educational purposes – i.e., its focus is readability, not practical usefulness.

Further reading  

  • If you want to read up on the various regular expression features mentioned in this blog post, you can check out chapter “Regular expressions (RegExp)” in “Exploring JavaScript”.
  • The Gist “Safe RegExp escape” by Kevin Gibbons explains which characters have to be escaped in various regular expression contexts.