In this blog post, we explore ways in which we can make regular expressions easier to use.
We’ll use the following regular expression as an example:
const RE_API_SIGNATURE =
/^(new |get )?([A-Za-z0-9_.\[\]]+)/;
Right now, it is still fairly cryptic. It will be much easier to understand once we get to the tip about “insignificant whitespace”.
/v
If we add flag /v
to our regular expression, we get fewer quirks and more features:
const RE_API_SIGNATURE =
/^(new |get )?([A-Za-z0-9_.\[\]]+)/v;
/v
doesn’t change anything in this particular case, but it helps us if we add grapheme clusters with more than one code point or if we want features such as set operations in character classes.
If there is more than one flag, we should order the flags alphabetically – e.g.:
/pattern/giv
That makes ordering consistent and is also how JavaScript displays regular expressions:
> String(/pattern/vgi)
'/pattern/giv'
Our regular expression contains two positional capture groups. If we name them, they describe their purposes and we need less external documentation:
const RE_API_SIGNATURE =
/^(?<prefix>new |get )?(?<name>[A-Za-z0-9_.\[\]]+)/;
#
So far, the regular expression is still fairly hard to read. We can change that by adding spaces and line breaks. Since the built-in regular expression literals don’t allow us to do that, we use the library Regex+ which provides us with the template tag regex
:
import {regex} from 'regex';
const RE_API_SIGNATURE = regex`
^
(?<prefix>
new \x20 # constructor
|
get \x20 # getter
)?
(?<name>
# Square brackets are needed for symbol keys
[
A-Z a-z 0-9 _
.
\[ \]
]+
)
`;
That’s much easier to read, right?
The feature of ignoring whitespace in regular expression patterns is called insignificant whitespace. Additionally, we used a feature called inline comments – which are started by hash symbols (#
).
Two observations:
\x20
to express that there is a space after new
and after get
.In the future, JavaScript may get built-in support for insignificant whitespace via a flag /x
(ECMAScript proposal).
With the regex
template tag, the following flags are always active:
/v
/x
(emulated) enables insignificant whitespace and line comments via #
./n
(emulated) enables named capture only mode, which prevents numbered groups from capturing. In other words: (pattern)
is treated like (?:pattern)
.To make sure that a regular expression works as intended, we can write tests for it. These are tests for RE_API_SIGNATURE
:
assert.deepEqual(
getCaptures(`get Map.prototype.size`),
{
prefix: 'get ',
name: 'Map.prototype.size',
}
);
assert.deepEqual(
getCaptures(`new Array(len = 0)`),
{
prefix: 'new ',
name: 'Array',
}
);
assert.deepEqual(
getCaptures(`Array.prototype.push(...items)`),
{
prefix: undefined,
name: 'Array.prototype.push',
}
);
assert.deepEqual(
getCaptures(`Map.prototype[Symbol.iterator]()`),
{
prefix: undefined,
name: 'Map.prototype[Symbol.iterator]',
}
);
function getCaptures(apiSignature) {
const match = RE_API_SIGNATURE.exec(apiSignature);
// Spread so that the result does not have a null prototype
// and is easier to compare.
return {...match.groups};
}
Seeing strings that match, helps with understanding what a regular expression is supposed to do:
/**
* Matches API signatures – e.g.:
* ```
* `get Map.prototype.size`
* `new Array(len = 0)`
* `Array.prototype.push(...items)`
* `Map.prototype[Symbol.iterator]()`
* ```
*/
const RE_API_SIGNATURE = regex`
···
`;
Some documentation tools let us refer to unit tests in doc comments and show their code in the documentation. That’s a good alternative to what we have done above.
The Regex+ library lets us interpolate regular expression fragments (“patterns”), which helps with reuse. The following example defines a simple markup syntax that is reminiscent of HTML:
import { pattern, regex } from 'regex';
const LABEL = pattern`[a-z\-]+`;
const ARGS = pattern`
(?<args>
\x20+
${LABEL}
)*
`;
const NAME = pattern`
(?<name> ${LABEL} )
`;
const TAG = regex`
(?<openingTag>
\[
\x20*
${NAME}
${ARGS}
\x20*
\]
)
|
(?<singletonTag>
\[
\x20*
${NAME}
${ARGS}
\x20*
/ \]
)
`;
assert.deepEqual(
TAG.exec('[pre js line-numbers]').groups,
{
openingTag: '[pre js line-numbers]',
name: 'pre',
args: ' line-numbers',
singletonTag: undefined,
__proto__: null,
}
);
assert.deepEqual(
TAG.exec('[hr /]').groups,
{
openingTag: undefined,
name: 'hr',
args: undefined,
singletonTag: '[hr /]',
__proto__: null,
}
);
The regular expression TAG
uses the regular expression fragments NAME
and ARGS
twice – which reduces redundancy.
With the following trick, we don’t need a library to write a regular expression with insignificant whitespace:
const RE_API_SIGNATURE = new RegExp(
String.raw`
^
(?<prefix>
new \x20
|
get \x20
)?
(?<name>
[
A-Z a-z 0-9 _
.
\[ \]
]+
)
`.replaceAll(/\s+/g, ''), // (A)
'v'
);
assert.equal(
String(RE_API_SIGNATURE),
String.raw`/^(?<prefix>new\x20|get\x20)?(?<name>[A-Za-z0-9_.\[\]]+)/v`
);
How does this code work?
String.raw
means we don’t have to escape regular expression backslashes inside the literal. Due to the backticks (vs. single straight quotes or double straight quotes), the regular expression can span multiple lines..replaceAll()
removes all whitespace (spaces, tabs, line breaks, etc.) so that the end result looks almost like the initial version of the regular expression. There is one difference, though: Since literal spaces are removed, we have to find a different way to specify that there is a space after new
and after get
. One option is the hex escape \x20
: hexadecimal 20 (decimal 32) is the code point SPACE.We can even emulate inline comments like this:
// Template tag function
const cmt = () => '';
const RE = new RegExp(
String.raw`
a+ ${cmt`one or more as`}
`.replaceAll(/\s+/g, ''),
'v'
);
assert.equal(
String(RE), '/a+/v'
);
Alas, it’s more syntactically noisy than I’d like.
One reason why many people don’t like regular expressions is that they find them difficult to read. However, that is much less of a problem with insignificant whitespace and comments. I’d argue that is the proper way of writing regular expressions: Think what JavaScript code would look like if we had to write it without whitespace and comments.
This blog post is a section in the chapter on regular expressions in my book “Exploring JavaScript” – which is free to read online.