This blog post is a brief introduction to Unicode and how it is handled in JavaScript.
The first Unicode draft proposal was published in 1988. Work continued afterwards and the working group expanded. The Unicode Consortium was incorporated on January 3, 1991:
The Unicode Consortium is a non-profit corporation devoted to developing, maintaining, and promoting software internationalization standards and data, particularly the Unicode Standard [...]The first volume of the Unicode 1.0 standard was published in October 1991, the second one in June 1992.
UTF-16 is a format with 16 bit code units that needs one to two units to represent a code point. BMP code points can be represented by single code units. Higher code points are 20 bit, after subtracting 0x10000 (the range of the BMP). These bits are encoded as two code units:
function toUTF16(codePoint) {
var TEN_BITS = 0b1111111111;
function u(codeUnit) {
return '\\u'+codeUnit.toString(16).toUpperCase();
}
if (codePoint <= 0xFFFF) {
return u(codePoint);
}
codePoint -= 0x10000;
// Shift right to get to most significant 10 bits
var leadSurrogate = 0xD800 + (codePoint >> 10);
// Mask to get least significant 10 bits
var tailSurrogate = 0xDC00 + (codePoint & TEN_BITS);
return u(leadSurrogate) + u(tailSurrogate);
}
UCS-2, a deprecated format, uses 16 bit code units to represent (only!) the code points of the BMP. When the range of Unicode code points expanded beyond 16 bits, UTF-16 replaced UCS-2.
UTF-8. UTF-8 has 8 bit code units. It builds a bridge between the legacy ASCII encoding and Unicode. ASCII only has 128 characters, whose numbers are the same as the first 128 Unicode code points. UTF-8 is backwards compatible, because all ASCII characters are valid code units. In other words, a single code unit in the range 0–127 encodes a single code point in the same range. Such code units are marked by their highest bit being zero. If, on the other hand, the highest bit is one then more units will follow, to provide the additional bits for the higher code points. That leads to the following encoding scheme:
UTF-8 has become the most popular Unicode format. Initially, due to its backwards compatibility with ASCII. Later, due to its broad support across operating systems, programming environments and applications.
ECMAScript source text is represented as a sequence of characters in the Unicode character encoding, version 3.0 or later. [...] ECMAScript source text is assumed to be a sequence of 16-bit code units for the purposes of this specification. [...] If an actual source text is encoded in a form other than 16-bit code units it must be processed as if it was first converted to UTF-16.In identifiers, string literals and regular expression literals, any code unit can also be expressed via a Unicode escape sequence \uHHHH, where HHHH are four hexadecimal digits. For example:
> var f\u006F\u006F = 'abc';
> foo
'abc'
> var λ = 123;
> \u03BB
123
That means that you can use Unicode characters in literals and variable names, without leaving the ASCII range in the source code.
In string literals, an additional kind of escape is available: hex escape sequences with two-digit hexadecimal numbers that represent code units in the range 0x00–0xFF. For example:
> '\xF6' === 'ö'
true
> '\xF6' === '\u00F6'
true
Content-Type: application/javascript; charset=utf-8Note: the correct media type (formerly known as MIME type) for JavaScript files is application/javascript. However, older browsers (e.g. Internet Explorer 8 and earlier) work most reliably with text/javascript. Unfortunately, the default value for the attribute type of <script> tags is text/javascript. At least, you can omit that attribute for JavaScript; there is no benefit in including it.
<!doctype html>
<html>
<head>
<meta charset="UTF-8">
...
It is highly recommended to always specify an encoding. If you don’t, a locale-specific default encoding is used. That is, people will see the file differently in different countries. Only the lowest 7 bit are relatively stable across locales.
uglifyjs -b beautify=false,ascii-only=true test.jsThe file test.js looks like this:
var σ = 'Köln';The output of UglifyJS looks like this:
var \u03c3="K\xf6ln";Negative example: For a while, the library D3.js was published in UTF-8. That caused an error when it was loaded from a page whose encoding was not UTF-8, because the code contained statements such as
var π = Math.PI, ε = 1e-6;The identifiers π and ε were not decoded correctly and not recognized as valid variable names. Additionally, some string literals with code points beyond 7 bit weren’t decoded correctly, either. As a work-around, the code could be loaded by adding the appropriate charset attribute to the script tag:
<script charset="utf-8" src="d3.js"></script>
When a String contains actual textual data, each element is considered to be a single UTF-16 code unit.Escape sequences. As mentioned before, you can use Unicode escape sequences and hex escape sequences in string literals. For example, you can produce the character “ö” by combining an “o” with a diaeresis (code point 0x0308):
> console.log('o\u0308')
ö
This works in command lines, such as web browser consoles and the Node.js REPL in a terminal. You can also insert this kind of string into the DOM of a web page.
Refering to astral plane characters via escapes. There are many nice Unicode symbol tables on the web. Take a look at Tim Whitlock’s “Emoji Unicode Tables” and be amazed by how many symbols there are in modern Unicode fonts. None of the symbols in the table are images, they are all font glyphs. Let’s assume you want to display a character via JavaScript that is in an astral plane. For example, a cow (code point 0x1F404):
🐄You can either copy the character and paste it directly into your Unicode-encoded JavaScript source:
var str = '🐄';JavaScript engines will decode the source (which is most often in UTF-8) and create a string with two UTF-16 code units. Alternatively, you can compute the two code units yourself and use Unicode escape sequences. There are web apps that perform this computation:
> toUTF16(0x1F404)
'\\uD83D\\uDC04'
The UTF-16 surrogate pair (0xD83D, 0xDC04) does indeed encode the cow:
> console.log('\uD83D\uDC04')
🐄
Counting characters. If a string contains a surrogate pair (two code units encoding a single code point) then the length property doesn’t count characters, any more. It counts code units:
> var str = '🐄';
> str === '\uD83D\uDC04'
true
> str.length
2
This can be fixed via libraries, such as Mathias Bynens’ Punycode.js, which is bundled with Node.js:
> var puny = require('punycode');
> puny.ucs2.decode(str).length
1
Unicode normalization. If you want to search in strings or compare them then you need to normalize, e.g. via the library unorm (by Bjarke Walling).
Line terminators influence matching and do have a Unicode definition. A line terminator is either one of four characters:
| Code unit | Name | Character escape sequence |
| \u000A | Line feed | \n |
| \u000D | Carriage return | \r |
| \u2028 | Line separator | |
| \u2029 | Paragraph separator |
The following regular expression constructs support Unicode:
> /^\s$/.test('\uFEFF')
true
> /\bb/.test('über')
true
([\0-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF])The above pattern works like this:
([BMP code point]|[lead surrogate][tail surrogate])As all of these ranges are disjoint, the pattern will correctly match code units in well-formed UTF-16 strings.
XRegExp is a regular expression library that has an official addon for matching Unicode categories, scripts, blocks and properties via one of the following three constructs:
\p{...} \P{...} \p{^...}
For example, \p{Letter} matches letters in various alphabets.