/v
makes character classes and character class escapes more powerfulIn this blog post, we look at the ECMAScript proposal “RegExp v
flag with set notation + properties of strings” by Markus Scherer and Mathias Bynens.
/v
The proposed new regular expression flag /v
(.unicodeSets
) enables three features:
Support for multi-code-point graphemes (such as some emojis) for character classes and Unicode property escapes (\p{}
).
Character classes can be nested and combined via the set operations subtraction and intersection.
The flag also improves case-insensitive matching for negated character classes.
Given that the syntax had to be changed to enable nested character classes and set operations, a new flag was the best solution. /v
can be viewed as an upgrade for flag /u
: The two flags are mutually exclusive.
The presence of /v
can be detected via the boolean getter .unicodeSets
:
> /abc/.unicodeSets
false
> /abc/v.unicodeSets
true
The following code shows examples of these concepts:
// A code point that can be encoded as a single code unit.
// That’s why the length of the string is 1.
assert.equal(
'⛔'.length, 1
);
// A code point that is encoded as two code units.
assert.equal(
'🙂'.length, 2
);
// A grapheme that is composed of more than one code point.
// We need 5 code units to encode it.
assert.equal(
'😵💫'.length, 5
);
Code units, code points and graphemes also matter when it comes to spliting a string into parts.
The string method .split()
splits a string into code units:
assert.deepEqual(
'⛔🙂😵💫'.split(''),
[
'⛔',
'\uD83D', '\uDE42',
'\uD83D', '\uDE35', '\u200D', '\uD83D', '\uDCAB'
]
);
Iterating over a string (via Array.from()
, spreading, destructuring, for-of
, etc.) splits it into code points:
assert.deepEqual(
Array.from('⛔🙂😵💫'),
[ '⛔', '🙂', '😵', '\u200D', '💫' ]
);
Intl.Segmenter
(which is not part of ECMAScript proper but of the ECMAScript Internationalization API) can split strings into graphemes:
const segmenter = new Intl.Segmenter(
'en', {granularity: 'grapheme'}
);
assert.deepEqual(
// Convert the iterable returned by .segment() in an Array
Array.from(
segmenter.segment('⛔🙂😵💫'),
s => s.segment // map (unwrap each iterated value)
),
[ '⛔', '🙂', '😵💫' ]
);
A character set is a set of Unicode entities to be matched. Depending on regular expression flags, these entities are either code units, code points or code point sequences.
Character class escapes and character classes are syntax for defining character sets:
A character class escape defines a character set via a predefined name or a key-value pair. It is loosely similar to a variable name in JavaScript.
\d \p{Decimal_Number} \p{Script=Greek}
A character class defines a character set by combining constructs that define character sets. It is delimited by square brackets and loosely similar to an expression in JavaScript.
[αβγ] [^A-Z] [\d\s]
Depending on which flags a regular expression has, character class escapes and character classes define either:
/u
nor /v
: character sets contain code units With neither /u
nor /v
, character classes match code units:
> /^[AΩ⛔]$/.test('A')
true
> /^[AΩ⛔]$/.test('Ω')
true
> /^[AΩ⛔]$/.test('⛔')
true
Without /u
and /v
, the following character class escapes are supported:
\d
is equivalent to [0-9]
\D
is equivalent to [^0-9]
\s
matches all whitespace code points (which are all encoded as single code units)
\S
matches the complement of \s
\w
is equivalent to [a-zA-Z0-9_]
\W
is equivalent to [^a-zA-Z0-9_]
We can’t use code unit character classes to match a code point that is encoded as two code units because it produces two separate character set elements:
> /^[🙂]$/.test('🙂')
false
This test is equivalent to:
> /^[\uD83D\uDE42]$/.test('🙂')
false
Code unit character class escapes have the same downside:
> '🙂'.replaceAll(/\D/g, 'X')
'XX'
/u
: character classes as sets of code points The regular expression flag /u
(.unicode
) was added in ECMAScript 6. With this flag, character sets contain code points and the previously mentioned limitations go away:
> /^[🙂]$/.test('🙂')
false
> /^[🙂]$/u.test('🙂')
true
> '🙂'.replaceAll(/\D/g, 'X')
'XX'
> '🙂'.replaceAll(/\D/gu, 'X')
'X'
Flag /u
also enables Unicode property escapes (which were added to JavaScript in ECMAScript 2018):
> /^\p{Emoji}$/u.test('⛔')
true
> /^\p{Emoji}$/u.test('🙂')
true
Since character set elements are code points, we can’t match sequences of code points – for example:
> /^[😵💫]$/u.test('😵💫')
false
> /^\p{Emoji}$/u.test('😵💫')
false
/v
: extended character classes as sets of code point sequences (“strings”) With the proposed flag /v
, character sets contain code point sequences (“strings”).
Flag /v
enables a new feature inside character classes – we can use \q{}
to add code points sequences to their character sets:
> /^[\q{😵💫}]$/v.test('😵💫')
true
We can use a single \q{}
to add multiple code point sequences – if we separate them with pipes:
> /^[\q{abc|def}]$/v.test('abc')
true
> /^[\q{abc|def}]$/v.test('def')
true
With /u
, we can use Unicode property escapes (\p{}
and \P{}
) to specify sets of code points via Unicode properties.
With /v
, we can also use them to specify sets of code point sequences via Unicode properties of strings:
> /^\p{RGI_Emoji}$/v.test('😵💫')
true
> '😵💫'.replaceAll(/^\p{RGI_Emoji}$/gv, 'X')
'X'
For now, the following Unicode properties of strings are supported:
Basic_Emoji
: single code pointsEmoji_Keycap_Sequence
: e.g. 1️⃣RGI_Emoji_Modifier_Sequence
: e.g. ☝🏿RGI_Emoji_Flag_Sequence
: e.g. 🇰🇪RGI_Emoji_Tag_Sequence
: e.g. 🏴RGI_Emoji_ZWJ_Sequence
: e.g. 🧑🌾RGI_Emoji
: union of all of the above setsThese properties are defined in text files:
It’s interesting that the definitions are simply enumerations of code point sequences – for example:
0023 FE0F 20E3 ; Emoji_Keycap_Sequence ; keycap: \x{23}
002A FE0F 20E3 ; Emoji_Keycap_Sequence ; keycap: *
0030 FE0F 20E3 ; Emoji_Keycap_Sequence ; keycap: 0
0031 FE0F 20E3 ; Emoji_Keycap_Sequence ; keycap: 1
0032 FE0F 20E3 ; Emoji_Keycap_Sequence ; keycap: 2
0033 FE0F 20E3 ; Emoji_Keycap_Sequence ; keycap: 3
0034 FE0F 20E3 ; Emoji_Keycap_Sequence ; keycap: 4
0035 FE0F 20E3 ; Emoji_Keycap_Sequence ; keycap: 5
0036 FE0F 20E3 ; Emoji_Keycap_Sequence ; keycap: 6
0037 FE0F 20E3 ; Emoji_Keycap_Sequence ; keycap: 7
0038 FE0F 20E3 ; Emoji_Keycap_Sequence ; keycap: 8
0039 FE0F 20E3 ; Emoji_Keycap_Sequence ; keycap: 9
There are plans to support Unicode properties of strings with the /u
flag, so this feature may not remain exclusive to /v
.
Negated character classes still only match code points:
> '😵💫'.replaceAll(/[^0-9]/gv, 'X')
'XXXXX'
Negating Unicode properties of strings is a syntax error.
This issue could be fixed if regular expressions treated input strings as sequences of graphemes. However, that would be a significant undertaking and is beyond the scope of the proposal.
To enable set operations for character classes, we must be able to nest them. With character class escapes, there already is nesting. Flag /v
also lets us nest character classes. The following two regular expressions are equivalent:
> /^[\d\w]$/v.test('7')
true
> /^[\d\w]$/v.test('H')
true
> /^[\d\w]$/v.test('?')
false
> /^[[0-9][A-Za-z0-9_]]$/v.test('7')
true
> /^[[0-9][A-Za-z0-9_]]$/v.test('H')
true
> /^[[0-9][A-Za-z0-9_]]$/v.test('?')
false
--
We can use the --
operator to set-theoretically subtract the character sets defined by character classes or character class escapes:
> /^[\w--[a-g]]$/v.test('a')
false
> /^[\w--[a-g]]$/v.test('h')
true
> /^[\p{Number}--[0-9]]$/v.test('٣')
true
> /^[\p{Number}--[0-9]]$/v.test('3')
false
> /^[\p{RGI_Emoji_Flag_Sequence}--\q{🇩🇪}]$/v.test('🇳🇿')
true
> /^[\p{RGI_Emoji_Flag_Sequence}--\q{🇩🇪}]$/v.test('🇩🇪')
false
Single code points can also be used on either side of the --
operator:
> /^[\w--a]$/v.test('a')
false
> /^[\w--a]$/v.test('b')
true
&&
We can use the &&
operator to set-theoretically intersect the character sets defined by character classes or character class escapes:
> /[\p{ASCII}&&\p{Decimal_Number}]/v.test('4')
true
> /[\p{ASCII}&&\p{Decimal_Number}]/v.test('X')
false
> /^[\p{Script=Arabic}&&\p{Number}]$/v.test('٣')
true
> /^[\p{Script=Arabic}&&\p{Number}]$/v.test('ق')
false
Two compute the set-theoretical union of character sets, we only need to write their definining constructs next to each other inside a character class:
> /^[\p{Emoji_Keycap_Sequence}[a-z]]+$/v.test('a3️⃣c')
true
Flag /u
has a quirk when it comes to case-insensitive matching.
If a character class escape is negated, the complement of its character set is computed first and then /i
is handled (via Unicode Simple_Case_Folding
(SCF)) – for example:
> /^\P{Lowercase_Letter}$/iu.test('A')
true
> /^\P{Lowercase_Letter}$/iu.test('a')
true
The character set of \P{Lowercase_Letter}
includes (among other code points) lowercase letters such as “a”. During SCF, their uppercase versions are added, which explains the results in the previous example.
If a character class is negated, SCF is applied to the character class before its complement is computed – for example:
> /^[^\p{Lowercase_Letter}]$/iu.test('A')
false
> /^[^\p{Lowercase_Letter}]$/iu.test('a')
false
The character set of \p{Lowercase_Letter}
contains all lowercase letters. After SCF, it also contains all uppercase letters. The complement of that set matches neither lowercase nor uppercase letters, which explains the results in the previous example.
For comparison, this is what happens without /i
:
> /^\P{Lowercase_Letter}$/u.test('A')
true
> /^\P{Lowercase_Letter}$/u.test('a')
false
Two observations:
/i
to a regular expression, it should match at least as many strings as before (not fewer).That’s why with flag /v
, case folding (“deep case closure”) is performed after all character sets were computed (loosely similarly to the first example in this section):
> /^\P{Lowercase_Letter}$/iv.test('A')
true
> /^\P{Lowercase_Letter}$/iv.test('a')
true
> /^[^\p{Lowercase_Letter}]$/iv.test('A')
true
> /^[^\p{Lowercase_Letter}]$/iv.test('a')
true
Source of this section: GitHub issue “IgnoreCase vs. complement vs. nested class”
/v
character classes? Inside /u
character classes, we must escape:
\ ]
Some characters only have to be escaped in some locations:
-
only has to be escaped if it doesn’t come first or last.^
only has to be escaped if it comes first.Inside /v
character classes, we additionally must always escape:
Special characters:
( ) [ { } / - |
Double punctuators:
!! ## $$ %% ** ++ ,, .. :: ;; << == >> ?? @@ ^^ `` ~~
Consequences:
When escaping plain text for regular expressions, more characters must be escaped:
/ - & ! # % , : ; < = > @ ` ~
Alas, escaping these characters is currently illegal with flag /u
. There are plans to change that, though.
@babel/plugin-proposal-unicode-sets-regex
transpiles regular expressions with flag /v
:
RegExp.prototype.unicodeSets
.Sources of this blog post (in addition to the proposal itself):
v
flag with set notation and properties of strings” by Mark Davis, Markus Scherer, and Mathias BynensMore information on some of the topics covered in this blog post:
Chapter “Unicode – a brief introduction” in “JavaScript for impatient programmers”
Section “Flag: Unicode mode via /u
” in “JavaScript for impatient programmers”
Section “Unicode property escapes” in “JavaScript for impatient programmers”
Useful resources: