Fun anecdote told by John Resig (check the post for more details and links):
One of the first implementations of the HTML 5 parsing rules was actually created to power the HTML 5 validator. [...] This particular implementation is in Java [...] Henri Sivonen (the author of the validator) just recently landed [...] a brand new HTML 5 parsing engine in Gecko, destined for the next version of Firefox. What’s interesting about this particular implementation is that it’s actually an automated conversion of Henri’s Java HTML 5 parser to C++. This conversion happens automatically and changes will be pushed upstream to the Mozilla codebase.The Webkit blog has more on HTML5 parsing. It lists three main advantages:
We’ve been implementing the HTML5 parsing algorithm in phases. Two months ago, we finished the first phase, which consisted of the tokenization algorithm. Late last night, we finished the second major piece: the tree builder algorithm. Together, these two algorithms form the core of the parser and consist of over 10,000 lines of code. In the next phase, we’ll tackle fragment parsing (which is used by innerHTML and HTML5test.com).This is one important argument against this kind of lenient parsing: You cannot easily implement an HTML parser, which is a piece of software that has many uses (browsers, crawlers, screen scrapers, transformers, ...). I wonder if we should introduce a strict mode for HTML5 for people who actually want to write clean, easily parsable, HTML.