ECMAScript proposal: RegExp flag /v makes character classes and character class escapes more powerful

[2022-11-15] dev, javascript, es proposal
(Ad, please don’t block)

In this blog post, we look at the ECMAScript proposal “RegExp v flag with set notation + properties of strings” by Markus Scherer and Mathias Bynens.

The new flag /v  

The proposed new regular expression flag /v (.unicodeSets) enables three features:

  • Support for multi-code-point graphemes (such as some emojis) for character classes and Unicode property escapes (\p{}).

  • Character classes can be nested and combined via the set operations subtraction and intersection.

  • The flag also improves case-insensitive matching for negated character classes.

Given that the syntax had to be changed to enable nested character classes and set operations, a new flag was the best solution. /v can be viewed as an upgrade for flag /u: The two flags are mutually exclusive.

The presence of /v can be detected via the boolean getter .unicodeSets:

> /abc/.unicodeSets
> /abc/v.unicodeSets

Recap: code units vs. code points vs. graphemes  

  • Unicode code points are the atomic parts of Unicode text. They have a range of 21 bits.
  • When Unicode is stored in a location such as a file or a programming language string, the atomic parts of that location are often smaller than 21 bits. Then one or more storage parts must be used to encode a code point. Unicode calls such storage parts code units.
    • JavaScript string code units (which JavaScript calls characters) are 16 bits in size. One or two code units are used to encode a code point (UTF-16 format).
  • A grapheme is any symbol that can be used in Unicode text. A grapheme is composed of one or more code points. Graphemes are the real characters of Unicode.

The following code shows examples of these concepts:

// A code point that can be encoded as a single code unit.
// That’s why the length of the string is 1.
  '⛔'.length, 1

// A code point that is encoded as two code units.
  '🙂'.length, 2

// A grapheme that is composed of more than one code point.
// We need 5 code units to encode it.
  '😵‍💫'.length, 5

Splitting strings  

Code units, code points and graphemes also matter when it comes to spliting a string into parts.

The string method .split() splits a string into code units:

    '\uD83D', '\uDE42',
    '\uD83D', '\uDE35', '\u200D', '\uD83D', '\uDCAB'

Iterating over a string (via Array.from(), spreading, destructuring, for-of, etc.) splits it into code points:

  [ '⛔', '🙂', '😵', '\u200D', '💫' ]

Intl.Segmenter (which is not part of ECMAScript proper but of the ECMAScript Internationalization API) can split strings into graphemes:

const segmenter = new Intl.Segmenter(
  'en', {granularity: 'grapheme'}
  // Convert the iterable returned by .segment() in an Array
    s => s.segment // map (unwrap each iterated value)
  [ '⛔', '🙂', '😵‍💫' ]

Terminology: character class escapes and character classes  

A character set is a set of Unicode entities to be matched. Depending on regular expression flags, these entities are either code units, code points or code point sequences.

Character class escapes and character classes are syntax for defining character sets:

  • A character class escape defines a character set via a predefined name or a key-value pair. It is loosely similar to a variable name in JavaScript.

    • Examples: \d \p{Decimal_Number} \p{Script=Greek}
  • A character class defines a character set by combining constructs that define character sets. It is delimited by square brackets and loosely similar to an expression in JavaScript.

    • Examples: [αβγ] [^A-Z] [\d\s]

How character sets are influenced by RegExp flags  

Depending on which flags a regular expression has, character class escapes and character classes define either:

  • Sets of code units
  • Sets of code points
  • Sets of graphemes

Neither /u nor /v: character sets contain code units  

Character classes  

With neither /u nor /v, character classes match code units:

> /^[AΩ⛔]$/.test('A')
> /^[AΩ⛔]$/.test('Ω')
> /^[AΩ⛔]$/.test('⛔')

Character class escapes  

Without /u nor /v, the following character class escapes are supported:

  • \d is equivalent to [0-9]
    • \D is equivalent to [^0-9]
  • \s matches all whitespace code points (which are all encoded as single code units)
    • \S matches the complement of \s
  • \w is equivalent to [a-zA-Z0-9_]
    • \W is equivalent to [^a-zA-Z0-9_]

Limitation: We can’t match code points  

We can’t use code unit character classes to match a code point that is encoded as two code units because it produces two separate character set elements:

> /^[🙂]$/.test('🙂')

This test is equivalent to:

> /^[\uD83D\uDE42]$/.test('🙂')

Code unit character class escapes have the same downside:

> '🙂'.replaceAll(/\D/g, 'X')

/u: character classes as sets of code points  

The regular expression flag /u (.unicode) was added in ECMAScript 6. With this flag, character sets contain code points and the previously mentioned limitations go away:

> /^[🙂]$/.test('🙂')
> /^[🙂]$/u.test('🙂')

> '🙂'.replaceAll(/\D/g, 'X')
> '🙂'.replaceAll(/\D/gu, 'X')

Unicode property escapes  

Flag /u also enables Unicode property escapes (which were added to JavaScript in ECMAScript 2018):

> /^\p{Emoji}$/u.test('⛔')
> /^\p{Emoji}$/u.test('🙂')

Limitation: We can’t match sequences of code points  

Since character set elements are code points, we can’t match sequences of code points – for example:

> /^[😵‍💫]$/u.test('😵‍💫')
> /^\p{Emoji}$/u.test('😵‍💫')

/v: extended character classes as sets of code point sequences (“strings”)  

With the proposed flag /v, character sets contain code point sequences (“strings”).

String literals in character classes  

Flag /v enables a new feature inside character classes – we can use \q{} to add code points sequences to their character sets:

> /^[\q{😵‍💫}]$/v.test('😵‍💫')

We can use a single \q{} to add multiple code point sequences – if we separate them with pipes:

> /^[\q{abc|def}]$/v.test('abc')
> /^[\q{abc|def}]$/v.test('def')

Unicode properties of strings  

With /u, we can use Unicode property escapes (\p{} and \P{}) to specify sets of code points via Unicode properties.

With /v, we can also use them to specify sets of code point sequences via Unicode properties of strings:

> /^\p{RGI_Emoji}$/v.test('😵‍💫')
> '😵‍💫'.replaceAll(/^\p{RGI_Emoji}$/gv, 'X')

For now, the following Unicode properties of strings are supported:

  • Basic_Emoji: single code points
  • Emoji_Keycap_Sequence: e.g. 1️⃣
  • RGI_Emoji_Modifier_Sequence: e.g. ☝🏿
  • RGI_Emoji_Flag_Sequence: e.g. 🇰🇪
  • RGI_Emoji_Tag_Sequence: e.g. 🏴󠁧󠁢󠁳󠁣󠁴󠁿
  • RGI_Emoji_ZWJ_Sequence: e.g. 🧑‍🌾
  • RGI_Emoji: union of all of the above sets

These properties are defined in text files:

It’s interesting that the definitions are simply enumerations of code point sequences – for example:

0023 FE0F 20E3 ; Emoji_Keycap_Sequence ; keycap: \x{23}
002A FE0F 20E3 ; Emoji_Keycap_Sequence ; keycap: *
0030 FE0F 20E3 ; Emoji_Keycap_Sequence ; keycap: 0
0031 FE0F 20E3 ; Emoji_Keycap_Sequence ; keycap: 1
0032 FE0F 20E3 ; Emoji_Keycap_Sequence ; keycap: 2
0033 FE0F 20E3 ; Emoji_Keycap_Sequence ; keycap: 3
0034 FE0F 20E3 ; Emoji_Keycap_Sequence ; keycap: 4
0035 FE0F 20E3 ; Emoji_Keycap_Sequence ; keycap: 5
0036 FE0F 20E3 ; Emoji_Keycap_Sequence ; keycap: 6
0037 FE0F 20E3 ; Emoji_Keycap_Sequence ; keycap: 7
0038 FE0F 20E3 ; Emoji_Keycap_Sequence ; keycap: 8
0039 FE0F 20E3 ; Emoji_Keycap_Sequence ; keycap: 9

There are plans to support Unicode properties of strings with the /u flag, so this feature may not remain exclusive to /v.

Limitation: negated character classes and character class escapes  

Negated character classes still only match code points:

> '😵‍💫'.replaceAll(/[^0-9]/gv, 'X')

Negating Unicode properties of strings is a syntax error.

This issue could be fixed if regular expressions treated input strings as sequences of graphemes. However, that would be a significant undertaking and is beyond the scope of the proposal.

Set operations for character classes  

To enable set operations for character classes, we must be able to nest them. With character class escapes, there already is nesting. Flag /v also lets us nest character classes. The following two regular expressions are equivalent:

> /^[\d\w]$/v.test('7')
> /^[\d\w]$/v.test('H')
> /^[\d\w]$/v.test('?')

> /^[[0-9][A-Za-z0-9_]]$/v.test('7')
> /^[[0-9][A-Za-z0-9_]]$/v.test('H')
> /^[[0-9][A-Za-z0-9_]]$/v.test('?')

Subtraction of character sets via --  

We can use the -- operator to set-theoretically subtract the character sets defined by character classes or character class escapes:

> /^[\w--[a-g]]$/v.test('a')
> /^[\w--[a-g]]$/v.test('h')

> /^[\p{Number}--[0-9]]$/v.test('٣')
> /^[\p{Number}--[0-9]]$/v.test('3')

> /^[\p{RGI_Emoji_Flag_Sequence}--\q{🇩🇪}]$/v.test('🇳🇿')
> /^[\p{RGI_Emoji_Flag_Sequence}--\q{🇩🇪}]$/v.test('🇩🇪')

Single code points can also be used on either side of the -- operator:

> /^[\w--a]$/v.test('a')
> /^[\w--a]$/v.test('b')

Intersection of character sets via &&  

We can use the && operator to set-theoretically intersect the character sets defined by character classes or character class escapes:

> /[\p{ASCII}&&\p{Decimal_Number}]/v.test('4')
> /[\p{ASCII}&&\p{Decimal_Number}]/v.test('X')

> /^[\p{Script=Arabic}&&\p{Number}]$/v.test('٣')
> /^[\p{Script=Arabic}&&\p{Number}]$/v.test('ق')

Union of characters sets  

Two compute the set-theoretical union of character sets, we only need to write their definining constructs next to each other inside a character class:

> /^[\p{Emoji_Keycap_Sequence}[a-z]]+$/v.test('a3️⃣c')

Improved case-insensitive matching  

Flag /u has a quirk when it comes to case-insensitive matching.

If a character class escape is negated, the complement of its character set is computed first and then /i is handled (via Unicode Simple_Case_Folding (SCF)) – for example:

> /^\P{Lowercase_Letter}$/iu.test('A')
> /^\P{Lowercase_Letter}$/iu.test('a')

The character set of \P{Lowercase_Letter} includes (among other code points) lowercase letters such as “a”. During SCF, their uppercase versions are added, which explains the results in the previous example.

If a character class is negated, SCF is applied to the character class before its complement is computed – for example:

> /^[^\p{Lowercase_Letter}]$/iu.test('A')
> /^[^\p{Lowercase_Letter}]$/iu.test('a')

The character set of \p{Lowercase_Letter} contains all lowercase letters. After SCF, it also contains all uppercase letters. The complement of that set matches neither lowercase nor uppercase letters, which explains the results in the previous example.

For comparison, this is what happens without /i:

> /^\P{Lowercase_Letter}$/u.test('A')
> /^\P{Lowercase_Letter}$/u.test('a')

Two observations:

  • Both ways of negating should produce the same results.
  • Intuitively, if we add /i to a regular expression, it should match at least as many strings as before (not fewer).

That’s why with flag /v, case folding (“deep case closure”) is performed after all character sets were computed (loosely similarly to the first example in this section):

> /^\P{Lowercase_Letter}$/iv.test('A')
> /^\P{Lowercase_Letter}$/iv.test('a')

> /^[^\p{Lowercase_Letter}]$/iv.test('A')
> /^[^\p{Lowercase_Letter}]$/iv.test('a')

Source of this section: GitHub issue “IgnoreCase vs. complement vs. nested class”

Which characters must be escaped inside /v character classes?  

Inside /u character classes, we must escape:

\ ]

Some characters only have to be escaped in some locations:

  • - only has to be escaped if it doesn’t come first or last.
  • ^ only has to be escaped if it comes first.

Inside /v character classes, we additionally must always escape:

  • Special characters:

    ( ) [ { } / - |
  • Double punctuators:

    !! ## $$ %% ** ++ ,, .. :: ;; << == >> ?? @@ ^^ `` ~~




Sources of this blog post (in addition to the proposal itself):

More information on some of the topics covered in this blog post:

Useful resources: