ECMAScript proposal: RegExp flag `/v` makes character classes and character class escapes more powerful

[2022-11-15] dev, javascript, es proposal

In this blog post, we look at the ECMAScript proposal “RegExp v flag with set notation + properties of strings” by Markus Scherer and Mathias Bynens.

The new flag `/v`

The proposed new regular expression flag /v (.unicodeSets) enables three features:

Support for multi-code-point graphemes (such as some emojis) for character classes and Unicode property escapes (\p{}).
Character classes can be nested and combined via the set operations subtraction and intersection.
The flag also improves case-insensitive matching for negated character classes.

Given that the syntax had to be changed to enable nested character classes and set operations, a new flag was the best solution. /v can be viewed as an upgrade for flag /u: The two flags are mutually exclusive.

The presence of /v can be detected via the boolean getter .unicodeSets:

> /abc/.unicodeSets
false
> /abc/v.unicodeSets
true

Recap: code units vs. code points vs. graphemes

Unicode code points are the atomic parts of Unicode text. They have a range of 21 bits.
When Unicode is stored in a location such as a file or a programming language string, the atomic parts of that location are often smaller than 21 bits. Then one or more storage parts must be used to encode a code point. Unicode calls such storage parts code units.
- JavaScript string code units (which JavaScript calls characters) are 16 bits in size. One or two code units are used to encode a code point (UTF-16 format).
A grapheme is any symbol that can be used in Unicode text. A grapheme is composed of one or more code points. Graphemes are the real characters of Unicode.

The following code shows examples of these concepts:

// A code point that can be encoded as a single code unit.
// That’s why the length of the string is 1.
assert.equal(
  '⛔'.length, 1
);

// A code point that is encoded as two code units.
assert.equal(
  '🙂'.length, 2
);

// A grapheme that is composed of more than one code point.
// We need 5 code units to encode it.
assert.equal(
  '😵‍💫'.length, 5
);

Splitting strings

Code units, code points and graphemes also matter when it comes to spliting a string into parts.

The string method .split() splits a string into code units:

assert.deepEqual(
  '⛔🙂😵‍💫'.split(''),
  [
    '⛔',
    '\uD83D', '\uDE42',
    '\uD83D', '\uDE35', '\u200D', '\uD83D', '\uDCAB'
  ]
);

Iterating over a string (via Array.from(), spreading, destructuring, for-of, etc.) splits it into code points:

assert.deepEqual(
  Array.from('⛔🙂😵‍💫'),
  [ '⛔', '🙂', '😵', '\u200D', '💫' ]
);

Intl.Segmenter (which is not part of ECMAScript proper but of the ECMAScript Internationalization API) can split strings into graphemes:

const segmenter = new Intl.Segmenter(
  'en', {granularity: 'grapheme'}
);
assert.deepEqual(
  // Convert the iterable returned by .segment() in an Array
  Array.from(
    segmenter.segment('⛔🙂😵‍💫'),
    s => s.segment // map (unwrap each iterated value)
  ),
  [ '⛔', '🙂', '😵‍💫' ]
);

Terminology: character class escapes and character classes

A character set is a set of Unicode entities to be matched. Depending on regular expression flags, these entities are either code units, code points or code point sequences.

Character class escapes and character classes are syntax for defining character sets:

A character class escape defines a character set via a predefined name or a key-value pair. It is loosely similar to a variable name in JavaScript.
- Examples: \d \p{Decimal_Number} \p{Script=Greek}
A character class defines a character set by combining constructs that define character sets. It is delimited by square brackets and loosely similar to an expression in JavaScript.
- Examples: [αβγ] [^A-Z] [\d\s]

How character sets are influenced by RegExp flags

Depending on which flags a regular expression has, character class escapes and character classes define either:

Sets of code units
Sets of code points
Sets of graphemes

Neither `/u` nor `/v`: character sets contain code units

Character classes

With neither /u nor /v, character classes match code units:

> /^[AΩ⛔]$/.test('A')
true
> /^[AΩ⛔]$/.test('Ω')
true
> /^[AΩ⛔]$/.test('⛔')
true

Character class escapes

Without /u and /v, the following character class escapes are supported:

\d is equivalent to [0-9]
- \D is equivalent to [^0-9]
\s matches all whitespace code points (which are all encoded as single code units)
- \S matches the complement of \s
\w is equivalent to [a-zA-Z0-9_]
- \W is equivalent to [^a-zA-Z0-9_]

Limitation: We can’t match code points

We can’t use code unit character classes to match a code point that is encoded as two code units because it produces two separate character set elements:

> /^[🙂]$/.test('🙂')
false

This test is equivalent to:

> /^[\uD83D\uDE42]$/.test('🙂')
false

Code unit character class escapes have the same downside:

> '🙂'.replaceAll(/\D/g, 'X')
'XX'

`/u`: character classes as sets of code points

The regular expression flag /u (.unicode) was added in ECMAScript 6. With this flag, character sets contain code points and the previously mentioned limitations go away:

> /^[🙂]$/.test('🙂')
false
> /^[🙂]$/u.test('🙂')
true

> '🙂'.replaceAll(/\D/g, 'X')
'XX'
> '🙂'.replaceAll(/\D/gu, 'X')
'X'

Unicode property escapes

Flag /u also enables Unicode property escapes (which were added to JavaScript in ECMAScript 2018):

> /^\p{Emoji}$/u.test('⛔')
true
> /^\p{Emoji}$/u.test('🙂')
true

Limitation: We can’t match sequences of code points

Since character set elements are code points, we can’t match sequences of code points – for example:

> /^[😵‍💫]$/u.test('😵‍💫')
false
> /^\p{Emoji}$/u.test('😵‍💫')
false

`/v`: extended character classes as sets of code point sequences (“strings”)

With the proposed flag /v, character sets contain code point sequences (“strings”).

String literals in character classes

Flag /v enables a new feature inside character classes – we can use \q{} to add code points sequences to their character sets:

> /^[\q{😵‍💫}]$/v.test('😵‍💫')
true

We can use a single \q{} to add multiple code point sequences – if we separate them with pipes:

> /^[\q{abc|def}]$/v.test('abc')
true
> /^[\q{abc|def}]$/v.test('def')
true

Unicode properties of strings

With /u, we can use Unicode property escapes (\p{} and \P{}) to specify sets of code points via Unicode properties.

With /v, we can also use them to specify sets of code point sequences via Unicode properties of strings:

> /^\p{RGI_Emoji}$/v.test('😵‍💫')
true
> '😵‍💫'.replaceAll(/^\p{RGI_Emoji}$/gv, 'X')
'X'

For now, the following Unicode properties of strings are supported:

Basic_Emoji: single code points
Emoji_Keycap_Sequence: e.g. 1️⃣
RGI_Emoji_Modifier_Sequence: e.g. ☝🏿
RGI_Emoji_Flag_Sequence: e.g. 🇰🇪
RGI_Emoji_Tag_Sequence: e.g. 🏴󠁧󠁢󠁳󠁣󠁴󠁿
RGI_Emoji_ZWJ_Sequence: e.g. 🧑‍🌾
RGI_Emoji: union of all of the above sets

These properties are defined in text files:

It’s interesting that the definitions are simply enumerations of code point sequences – for example:

0023 FE0F 20E3 ; Emoji_Keycap_Sequence ; keycap: \x{23}
002A FE0F 20E3 ; Emoji_Keycap_Sequence ; keycap: *
0030 FE0F 20E3 ; Emoji_Keycap_Sequence ; keycap: 0
0031 FE0F 20E3 ; Emoji_Keycap_Sequence ; keycap: 1
0032 FE0F 20E3 ; Emoji_Keycap_Sequence ; keycap: 2
0033 FE0F 20E3 ; Emoji_Keycap_Sequence ; keycap: 3
0034 FE0F 20E3 ; Emoji_Keycap_Sequence ; keycap: 4
0035 FE0F 20E3 ; Emoji_Keycap_Sequence ; keycap: 5
0036 FE0F 20E3 ; Emoji_Keycap_Sequence ; keycap: 6
0037 FE0F 20E3 ; Emoji_Keycap_Sequence ; keycap: 7
0038 FE0F 20E3 ; Emoji_Keycap_Sequence ; keycap: 8
0039 FE0F 20E3 ; Emoji_Keycap_Sequence ; keycap: 9

There are plans to support Unicode properties of strings with the /u flag, so this feature may not remain exclusive to /v.

Limitation: negated character classes and character class escapes

Negated character classes still only match code points:

> '😵‍💫'.replaceAll(/[^0-9]/gv, 'X')
'XXXXX'

Negating Unicode properties of strings is a syntax error.

This issue could be fixed if regular expressions treated input strings as sequences of graphemes. However, that would be a significant undertaking and is beyond the scope of the proposal.

Set operations for character classes

To enable set operations for character classes, we must be able to nest them. With character class escapes, there already is nesting. Flag /v also lets us nest character classes. The following two regular expressions are equivalent:

> /^[\d\w]$/v.test('7')
true
> /^[\d\w]$/v.test('H')
true
> /^[\d\w]$/v.test('?')
false

> /^[[0-9][A-Za-z0-9_]]$/v.test('7')
true
> /^[[0-9][A-Za-z0-9_]]$/v.test('H')
true
> /^[[0-9][A-Za-z0-9_]]$/v.test('?')
false

Subtraction of character sets via `--`

We can use the -- operator to set-theoretically subtract the character sets defined by character classes or character class escapes:

> /^[\w--[a-g]]$/v.test('a')
false
> /^[\w--[a-g]]$/v.test('h')
true

> /^[\p{Number}--[0-9]]$/v.test('٣')
true
> /^[\p{Number}--[0-9]]$/v.test('3')
false

> /^[\p{RGI_Emoji_Flag_Sequence}--\q{🇩🇪}]$/v.test('🇳🇿')
true
> /^[\p{RGI_Emoji_Flag_Sequence}--\q{🇩🇪}]$/v.test('🇩🇪')
false

Single code points can also be used on either side of the -- operator:

> /^[\w--a]$/v.test('a')
false
> /^[\w--a]$/v.test('b')
true

Intersection of character sets via `&&`

We can use the && operator to set-theoretically intersect the character sets defined by character classes or character class escapes:

> /[\p{ASCII}&&\p{Decimal_Number}]/v.test('4')
true
> /[\p{ASCII}&&\p{Decimal_Number}]/v.test('X')
false

> /^[\p{Script=Arabic}&&\p{Number}]$/v.test('٣')
true
> /^[\p{Script=Arabic}&&\p{Number}]$/v.test('ق')
false

Union of characters sets

Two compute the set-theoretical union of character sets, we only need to write their definining constructs next to each other inside a character class:

> /^[\p{Emoji_Keycap_Sequence}[a-z]]+$/v.test('a3️⃣c')
true

Improved case-insensitive matching

Flag /u has a quirk when it comes to case-insensitive matching.

If a character class escape is negated, the complement of its character set is computed first and then /i is handled (via Unicode Simple_Case_Folding (SCF)) – for example:

> /^\P{Lowercase_Letter}$/iu.test('A')
true
> /^\P{Lowercase_Letter}$/iu.test('a')
true

The character set of \P{Lowercase_Letter} includes (among other code points) lowercase letters such as “a”. During SCF, their uppercase versions are added, which explains the results in the previous example.

If a character class is negated, SCF is applied to the character class before its complement is computed – for example:

> /^[^\p{Lowercase_Letter}]$/iu.test('A')
false
> /^[^\p{Lowercase_Letter}]$/iu.test('a')
false

The character set of \p{Lowercase_Letter} contains all lowercase letters. After SCF, it also contains all uppercase letters. The complement of that set matches neither lowercase nor uppercase letters, which explains the results in the previous example.

For comparison, this is what happens without /i:

> /^\P{Lowercase_Letter}$/u.test('A')
true
> /^\P{Lowercase_Letter}$/u.test('a')
false

Two observations:

Both ways of negating should produce the same results.
Intuitively, if we add /i to a regular expression, it should match at least as many strings as before (not fewer).

That’s why with flag /v, case folding (“deep case closure”) is performed after all character sets were computed (loosely similarly to the first example in this section):

> /^\P{Lowercase_Letter}$/iv.test('A')
true
> /^\P{Lowercase_Letter}$/iv.test('a')
true

> /^[^\p{Lowercase_Letter}]$/iv.test('A')
true
> /^[^\p{Lowercase_Letter}]$/iv.test('a')
true

Source of this section: GitHub issue “IgnoreCase vs. complement vs. nested class”

Which characters must be escaped inside `/v` character classes?

Inside /u character classes, we must escape:

\ ]

Some characters only have to be escaped in some locations:

- only has to be escaped if it doesn’t come first or last.
^ only has to be escaped if it comes first.

Inside /v character classes, we additionally must always escape:

Special characters:
```
( ) [ { } / - |
```

Double punctuators:

!! ## $$ %% ** ++ ,, .. :: ;; << == >> ?? @@ ^^ `` ~~

Consequences:

When escaping plain text for regular expressions, more characters must be escaped:
```
/ - & ! # % , : ; < = > @ ` ~
```
Alas, escaping these characters is currently illegal with flag /u. There are plans to change that, though.

Implementations

The Babel plugin @babel/plugin-proposal-unicode-sets-regex transpiles regular expressions with flag /v:
- It was used to test the code in this blog post.
- It does not support the getter RegExp.prototype.unicodeSets.

Resources

Sources of this blog post (in addition to the proposal itself):

Specification for the proposal
Article “RegExp v flag with set notation and properties of strings” by Mark Davis, Markus Scherer, and Mathias Bynens

More information on some of the topics covered in this blog post:

Chapter “Unicode – a brief introduction” in “Exploring JavaScript”
Section “Flag: Unicode mode via /u” in “Exploring JavaScript”
Section “Unicode property escapes” in “Exploring JavaScript”

Useful resources:

The Wikipedia page “Unicode character property” contains a list of Unicode character properties
The Emojipedia page “Every emoji by codepoint” lists the codepoints that make up emojis.
The Compart page “Unicode Block ‘Miscellaneous Symbols’” lists, among others, emojis that are in the Basic Multilingual Plane (16 bits).

ECMAScript proposal: RegExp flag /v makes character classes and character class escapes more powerful

The new flag /v #

Recap: code units vs. code points vs. graphemes #

Splitting strings #

Terminology: character class escapes and character classes #

How character sets are influenced by RegExp flags #

Neither /u nor /v: character sets contain code units #

Character classes #

Character class escapes #

Limitation: We can’t match code points #

/u: character classes as sets of code points #

Unicode property escapes #

Limitation: We can’t match sequences of code points #

/v: extended character classes as sets of code point sequences (“strings”) #

String literals in character classes #

Unicode properties of strings #

Limitation: negated character classes and character class escapes #

Set operations for character classes #

Subtraction of character sets via -- #

Intersection of character sets via && #

Union of characters sets #

Improved case-insensitive matching #

Which characters must be escaped inside /v character classes? #

Implementations #

Resources #