I’m going to make a bold prediction. Long after you and I are gone, HTML will still be around. Not just in billions of archived pages from our era, but as a living, breathing entity. Too much effort, energy, and investment has gone into developing the web’s tools, protocols, and platforms for it to be abandoned lightly, if indeed at all.
Let’s stop to consider our responsibility. By an accident of history, we are associated with the development of an important tool our civilization will use to communicate for decades to come. So, when we turn our minds, idly or in earnest, to improving HTML, we must understand just how far-reaching the ramifications of today’s decisions may be.
HTML 5, the W3C’s recently redoubled effort to shape the next generation of HTML, has, over the last year or so, taken on considerable momentum. It is an enormous project, covering not simply the structure of HTML, but also parsing models, error-handling models, the DOM, algorithms for resource fetching, media content, 2D drawing, data templating, security models, page loading models, client-side data storage, and more.
There are also revisions to the structure, syntax, and semantics of HTML, some of which Lachlan Hunt covered in “A Preview of HTML 5.”
But for this article, let’s turn solely to the semantics of HTML. It’s something I’ve been interested in for many years, and something which I believe is fundamentally important to the future of HTML.
The BBC recently announced that they would drop the hCalendar microformat from their program listings, due to accessibility and usability concerns with the abbr design pattern. This demonstrates that we have, beyond any doubt, pushed the semantic capability of HTML far past what was ever intended, and indeed, what is reasonably possible with the language. We have simply run out of HTML elements and attributes with which to mark up more richly semantic documents. If we continue to be clever with the existing constructs of HTML, more problems such as this will arise. But HTML suffers from a fundamental defect as a semantic markup language—its semantics are fixed, not extensible.
This is not simply a theoretical problem. Hundreds of thousands of developers use the class
and id
attributes of HTML to create more richly semantic markup. (They also use them as “hooks” for CSS styling, but that’s another matter.) Almost invariably, those developers use ad hoc vocabularies—that is, values they have made up, rather than values taken from existing schemas. It’s pseudo semantic markup at best.
Many pages around the web use microformats to add more structured semantics than available in HTML’s impoverished set of elements and attributes. In this case, the values used for the class
attribute come from agreed-upon vocabularies, sometimes adopted from other standards, such as vCard, sometimes from newly minted vocabularies where no solid pre-existing standard exists (as is the case for hReview).
Extensible semantics
There is a very real problem that needs to be solved here. We need mechanisms in HTML that clearly and unambiguously enable developers to add richer, more meaningful semantics—not pseudo semantics—to their markup. This is perhaps the single most pressing goal for the HTML 5 project.
But it’s not as simple as coming up with a mechanism to create richer semantics in HTML content: there are significant constraints on any solution. Perhaps the biggest one is backward compatibility. The solution can’t break the hundreds of millions of browsing devices in use today, which will continue to be used for years to come. Any solution that isn’t backward compatible won’t be widely adopted by developers for fear of excluding readers. It will quickly wither on the vine.
The solution must be forward compatible as well. Not in the sense that it must work in future browsers—that’s the responsibility of browser developers—but it must be extensible. We can’t expect any single solution we develop right now to solve all imaginable and unimaginable future semantic needs. We can develop a solution that can be extended to help meet future needs as they arise.
These two constraints in tandem, present a huge challenge. But in the context of a language whose major iterations arrive a decade apart, and whose importance as a global platform for communication is paramount, this is a challenge that must be solved.
So, how is HTML 5 addressing this issue? HTML 5 introduces a number of new elements. Some of these are what I’ve termed “structural”—section
, nav
, aside
, header
, and footer
. The dialog
element is a kind of content element, akin to blockquote
. There are also a number of data elements, such as meter
, which “represents a scalar measurement within a known range, or a fractional value; for example disk usage,” and the time
element, which represents a date and/or a time.
While these elements might be useful, and seem to have generated some interest, do they really solve the problem we’ve identified, particularly within the twin constraints of forward and backward compatibility?
Let’s consider each constraint.
Backward compatibility
How do current browsers handle these new elements, such as section
? Well, the most recent versions of Safari, Opera, Mozilla, and even IE7 will all render a page as follows.
<h1>Top Level Heading</h1> <section> <h1>Second Level Heading</h1> <p>this is text in a section element</p> <section> <h1>Third Level Heading</h1> </section> </section>
It looks like an excellent start. But when we try styling, for example, section elements with CSS that looks like this:
section {color: red}
…most of the above-mentioned browsers manage to style the element, but IE7 (and so presumably 6) do not.
So we have a serious backward compatibility issue with 75% of browsers currently in use. Given the half-life of Internet Explorer, we can predict that most users will be using IE6 or IE7 even several years from now.
If HTML 5 introduces these new elements, what is the likelihood they’ll be implemented by the vast majority of developers—given the knowledge that they’re essentially incompatible with the majority of browsers in use?
Unfortunately, if you are looking for alternative solutions to the CSS problem, putting class
attributes on your section
elements and then trying to style them using the class value won’t work in IE. Perhaps there is some kind of workaround out there, but unless there is, that looks like a deal breaker right there.
Let’s turn to forward compatibility, the second constraint.
Forward compatibility
We’ll start by posing the question: “why are we inventing these new elements?” A reasonable answer would be: “because HTML lacks semantic richness, and by adding these elements, we increase the semantic richness of HTML—that can’t be bad, can it?”
By adding these elements, we are addressing the need for greater semantic capability in HTML, but only within a narrow scope. No matter how many elements we bolt on, we will always think of more semantic goodness to add to HTML. And so, having added as many new elements as we like, we still won’t have solved the problem. We don’t need to add specific terms to the vocabulary of HTML, we need to add a mechanism that allows semantic richness to be added to a document as required. In technical terms, we need to make HTML extensible. HTML 5 proposes no mechanism for extensibility.
HTML 5, therefore, implements a feature that breaks a sizable percentage of current browsers, and doesn’t really allow us to add richer semantics to the language at all.
Several questions remain about the new elements. Where have these new element names come from? How was it decided that there should be a navigation element, and that it should be called “nav”? Why should the same term apply to page-level, site-level, and meta-site-level navigation?
Why not adopt an existing vocabulary, such as Docbook? Its document structure vocabulary is far richer and it’s been developed by publishing experts over many years. This is not an argument in favor of Docbook, specifically: the point is that the extremely important task of providing a mechanism for semantic richness in HTML is being approached in an ad hoc way, paying apparently little attention to best practices in related work going back 30 years or more. (The original work on GML began in the early 1970s.)
Some thoughts on a solution
So, having been critical of current efforts, do I have any practical suggestions on how to solve this problem? Well, I have the start of one.
If adding elements to HTML is out of the question, at least within the parameters of this discussion, attributes are the other logical area of HTML to concentrate on. After all, for nearly a decade, we’ve been using class
and id
attributes as mechanisms to extend the semantics of HTML. A great many developers are familiar and comfortable with this. The microformats project demonstrated that the existing attributes of HTML are not sufficient, as a generalized mechanism, to extend the semantics of HTML. So, if we are to use attributes to help solve this problem, we need to come up with one or more new attributes. Before we get into the mechanics of how that might work, it’s only fair to subject this suggestion to the same requirements we have for the new elements of HTML 5. Most importantly, is introducing new attributes to HTML backward compatible? And if so, does it provide a workable mechanism for semantic extensibility in HTML?
Let’s invent a new attribute. I’ll call it “structure,” but the particular name isn’t important. We can use it like this:
<div structure="header">
Let’s see how our browsers fare with this.
Of course, all our browsers will style this element with CSS.
div {color: red}
But how about this?
div[structure] {font-weight: bold}
In fact, almost all browsers, including IE7, style the div
with an attribute of structure
, even if there is no such thing as the structure
attribute! Sadly, our luck runs out there, as IE6 does not. But we can use the attribute in HTML and have all existing browsers recognize it. We can even use CSS to style our HTML using the attribute in all modern browsers. And, if we want a workaround for older browsers, we can add a class
value to the element for styling. Compare this with the HTML 5 solution, which adds new elements that cannot be styled in Internet Explorer 6 or 7 and you’ll see that this is definitely a more backward-compatible solution.
Extensibility through attributes
Instead of new elements, HTML 5 should adopt a number of new attributes. Each of these attributes would relate to a category or type of semantics. For example, as I’ve detailed in another article, HTML includes structural semantics, rhetorical semantics, role semantics (adopted from XHTML), and other classes or categories of semantics.
These new attributes could then be used much as the class
attribute is used: to attach to an element semantics that describe the nature of the element, or to add metadata about the element.
This is not dissimilar to the role attribute of XHTML, but rather than having a single attribute “bucket” for all element semantics, we should identify the different types of semantics for an element, and separate them out.
For example, the XHTML role
attribute works like this:
<ul role="navigation sitemap"> <li href="downloads">Downloads</li> <li href="docs">Documentation</li> <li href="news">News</li> </ul>
The values of the role
attribute are a space-separated list of words from the default vocabulary, or from a defined vocabulary.
Why not simply adopt the role
attribute as-is? Well, there are other kinds of semantics for which the term role doesn’t apply. For example:
<p rhetoric="irony">He’s a fantastic person.</p>
This demonstrates a theoretical type of semantics—“rhetoric,” which could be used to markup the rhetorical nature of a document. This element clearly doesn’t play the role of irony in the document. Rather, the contents of the element are ironic.
Here is another example. It’s increasingly obvious that HTML lacks a way to attach a machine readable version of a humanly readable value, e.g., a date. This is at the heart of the problem the BBC has with the hCalendar microformat that we referred to earlier. While <span role="2009-05-01">May Day next year</span>
really doesn’t make sense, something along the lines of <span equivalent="2009-05-01">May Day next year</span>
would.
Again, whether we use the specific term “equivalent” or some other term for this kind of semantic attribute is not the issue. What’s important to note is that it’s not as simple as using either the class
attribute or the role
attribute as a one-size-fits-all bucket to hold semantic information. For a properly extensible solution that provides backward compatibility and sufficient flexibility, a solution along these lines looks worth investigating.
I titled this section “some thoughts on a solution” because a significant amount of work needs to be done to really develop a workable solution. Open questions include the following.
- How many distinct semantic attributes should there be? Should these categories be extensible, and if so, how?
- How are vocabularies determined?
- Do we simply invent the terms we want, in much the same way that developers have been using
class
values, or should the possible values all be determined by a standardized specification? Or should there be a mechanism for inventing (and hopefully sharing) vocabularies, using some kind of profile? - If we have a conflict between two vocabularies, such that two identical terms are defined by two different vocabularies, how is this resolved?
- Do we need a form of name spacing, or does some other mechanism exist?
Rather than rushing to answer these questions, I’m posing them to highlight the issues that need to be addressed, and to start a dialog. The ramifications and reach of decisions made in HTML 5 are too great for decisions to be made in the absence of at least some input from those highly knowledgeable about linguistics, semantics, semiotics, and related fields.
Hopefully, if nothing else, then it’s clear that simply “making up new elements” isn’t a solution to how to increase the semantic capacity of HTML.
Let’s not rush into these decisions lightly—after all, with climate change we’ve saddled our grandkids with enough trouble as it is. Let’s at least leave them the best possible HTML we can.