Thinking Out Loud

On Monday I attended Mark Birbeck‘s seminar The Possibilities of RDFa and the Semantic Web, an ‘in-the-brain’ session on RDFa and structured data discussing some of the exciting possibilities created by authors implementing structured data via RDFa into Web pages.

We looked at some existing implementations, how to enhance very simple Web pages to be machine-’readable’ and looked at RDFa as part of the bigger picture – in correspondence to RDF, Microformats and other technologies, and the Semantic Web.

RDFa (Resource Description Framework in attributes) is a set of extensions to XHTML, a set of attributes intended to be implemented by Web page publishers to introduce extra ‘structure’ to the information in documents. These attributes are added at author-time and are completely invisible to the end-user. RDFa is essentially metadata, additional machine-readable indicators that describe the information they annotate without intruding on existing mark-up.

The W3C’s RDFa Primer is a good introduction for those unfamiliar with the technology and it conveys the current problem of the separation between the information on the Web readable by humans and that ‘readable’ by machine – the data Web.

Today’s Web is built predominantly for human consumption. Even as machine-readable data begins to appear on the Web it is typically distributed in a separate file, with a separate format, and very limited correspondence between the human and machine versions. As a result, Web browsers can provide only minimal assistance to humans in parsing and processing data: browsers only see presentation information.

On a typical Web page, an XHTML author might specify a headline, then a smaller sub-headline, a block of italicised text, a few paragraphs of average-size text, and, finally, a few single-word links. Web browsers will follow these presentation instructions faithfully. However, only the human mind understands that the headline is, in fact, the blog post title, the sub-headline indicates the author, the italicised text is the article’s publication date, and the single-word links are categorisation labels.

Presentation vs Semantics

The gap between what programs and humans understand is large. But when Web data meant for humans is augmented with machine-readable hints meant for the computer programs that look for them, these programs become significantly more helpful because they can begin to understand the data’s structure and meaning.

RDFa is getting a lot of attention lately, not least with developments of HTML5 and the consideration of its implementation with the specification. As of October 2008, RDFa in XHTML is a W3C Recommendation, but it was Mark Birkbeck who was first to propose RDFa in a note to the W3C in February 2004 – so I was looking forward to this.

Publishing information and publishing data

Mark started with a modern use-case for RDFa, showing us how we can and why we should put structured data in our sites.

He pointed out that online publishing has never been easier. There are a huge number of blogging services offering free, easy to set-up and maintain platforms that are long established and now with sites like Twitter offering the ability to update via SMS or from a mobile device, the Web is enabling a level of ubiquitous publishing that we’re now completely familiar with.

This is us publishing information, human-readable content. But he says that publishing data, however, still remains hard.

By data, he refers to information ‘without prose’ – think listings on eBay or Yelp, where numbers and stats are what matter over articles of text. He suggests that places like eBay and Yelp are the only places where publishing data is easy. But by using these sites, especially those where the majority of content is user-generated, raises questions as to who actually owns the data.

As a result, people have begun to make their own sites, for maintain their own data. That sounds easy enough, but the problem is that the most accessible means to make websites nowadays is to use the blogging platforms that I mentioned earlier – those that aren’t really meant for publishing data, they’re meant for publishing information.

Mark showed some examples where this is happening, WordPress and Blogger type sites set up by individuals for reviewing films, or books, or restaurants etc – where each blog ‘post’ was an entry for an individual film or book, and details like scores and ratings (the valuable data) is lumped together in the article body – because that’s all it can really do – it’s meant for prose.

Of course, that’s fine, for their own sake. They have their content on the Web, on a domain they control. But often this content is of real value, its useful for more than just the author and other people should be see it.

Often though without being on sites such as eBay or Yelp, users will infrequently find them. This is because the norm for finding content is by way of a search engine, i.e. using a machine to find human information. True, some independent sites do rank highly in search engines, but for any newly launched site it is near impossible.

So enter RDFa. By using RDFa, authors can package their existing content in additional mark-up so machines can discriminate between what parts of a long body of text refers specifically to valuable discreet details. Authors can target and isolate individual pieces of text and associate these with a predefined vocabulary of terms specifically for the relevant context of the subject data.

An example

The following is Mark’s example blog post, some typical mark-up for a book review:

<div>
<span>
Title: <span>Chanda’s Secrets</span>
</span>

Stars: <span>*****</span>

<span>
I reviewed Stratton’s newest teen novel, Leslie’s Journal in October.
I’d heard about Chanda’s Secrets and wanted to give it a try…
</span>

</div>

This is very simple mark-up, perfectly understandable for human consumption, but offers machines agents zero indication as to what the content refers to.

RDF is Resource Description Framework, it’s purpose is to describe. To do so, it references vocabularies of terms – definitions that provide agreed-upon tags for developers to use that indicate unambiguous details. Machine applications finding this data can then interpret these as they choose, matching the unique identifying terms to the same vocabulary in their process as reference to exactly what was intended, to infer meaning.

There is a vocabulary for book reviews, for example. The following code takes the mark-up above and adds RDFa, enriching the data for the machines ‘viewpoint’, while having no impact on its presentation – the human viewpoint:

<div xmlns:v=”http://rdf.data-vocabulary.org/#” typeof=”v:Review”>
<span rel=”v:itemreviewed”>
Title: <span property=”v:name”>Chanda’s Secrets</span>
</span>

Stars: <span property=”v:rating” content=”5″>*****</span>

<span property=”v:summary”>
I reviewed Stratton’s newest teen novel, Leslie’s Journal in October.
I’d heard about Chanda’s Secrets and wanted to give it a try…
</span>

</div>

A little run through of the code – the first line adds a reference to the RDFa vocabulary to use, defining a XML namespace, v, and declares that everything within this div is a review (type v:Review). We add v:itemreviewed to the title and specify its v:name explicitly. This review class has a rating property, so in the same way we add the v:rating property to the five stars. Here you can see another example of a lack of machine/human understanding – we can see five asterisks (they aren’t even five stars) and work out what they refer to – but a machine has no idea. So we ‘overwrite’ the inline text with the content attribute, declaring outright that our rating has a value of 5. The summary is tagged in the same way then.

Now it’s not like this means the post would jump straight to the top of Google’s search results. RDFa isn’t about solving that problem, but it helps. It allows machines to infer meaning and act accordingly, if the application is capable of doing so.

For example, any search engine could probably find this post as by matching the search term ‘Chanda’s Secrets’ in a standard lookup query. But recently Google added RDFa support and Yahoo! have been doing it for over a year, and both sites now have specific applications for parsing RDFa on top of their normal workings.

Google’s is called Rich Snippets, which returns this kind of result:

A Google 'Rich Snippet'

It’s an enhanced search result that can feature context-relevant details alongside the link, in this case a star-rating and price range indicator. It’s with this kind of application you can begin to see how adding structured data can make powerful improvements to your site.

Writing a review and having everything lumped together in the body of an article is fine for humans reading the content, they can make sense of the words, numbers and pictures themselves, but a machine can’t. By annotating your review – indicating, for example, a restaurant rating or the price of your food – you make your data machine-processable, so also able to be aggregated with others. This is called data distribution. In this way, you also still ‘own’ your content because it originates from your site, but it is as computable as those larger sites and databases.

More possibilities

As well as being used as a means to structure data in a desired format, RDFa can also be used to standardise data, again for interoperability and processing.

Mark presented some job aggregation sites that demonstrate this. Each site was made with a different technology (ASP .NET, PHP, static pages – whatever) and each displayed data in a different way, not only in their presentation but also by using different terminology – a job ‘title’ or ‘name’ or ‘position’, for example. RDFa indicators standardise these ambiguities, so third-party sites, such as aggregators, can intelligently collate the data with an understanding of the differences.

He also spoke about joining the linked data cloud, showing sites that could use the unique identifiers of a vocabulary with reference to their own content for an opportunity to enrich their pages. He had a nifty example that populated a book review page by using the relevant class and retrieving the book’s cover from a external source that had tagged their data in the same way. That match ensured the discovered data would be relevant.

Finally he talked about vertical search. Where RDFa, at least at it’s most rudimentary, can be used to work with ambiguous words, synonyms and such. A search for a chemical symbol, in traditional lookup search, would not take into account it’s chemical name or any other term that refers correctly to it, albeit by a different word. With an application that converts said terms to an agreed upon identifier, for example it’s chemical formula, users could retrieve all references, by whatever name, tagged also with that identifier.

Q&A

The Q&A to close the seminar was also really useful. Here Mark talked about Microformats, in comparison to RDFa, something I was waiting for. He pointed out that Microformats each require their own parser, need references to a separate vocabulary for each microformat used and commented on the standard being centrally maintained by a handful of people. RDF is decentralised, there is no governing body – that anyone can create a vocabulary (though of course that might not always be the best answer), and really made Microformats seem clunky and second-best to RDFa – an opinion I hadn’t held before.

Mark also pointed out that RDFa is arguably more extensible in that can relate to references not on the actual page of implementation. For example on a page where a license is being specified, with RDFa and by using the about tag, users can refer to objects linking from their page – such as listing different licenses for a set of external images or videos, rather than only being able to describe the license for the page itself, the page that these images link from.

Eventually there was some more Semantic Web talk, discussing technologies such as RDF, OWL and ontologies – but not as much as I’d hoped.

In fact, a lot of the content of the talk can be found in Mark’s excellent articles written for A List Apart, his Introduction to RDFa and Introduction to RDFa, Part II.

I highly recommend reading these, he explores the different methods of annotating your pages, how to apply attributes to various kinds of elements, discusses vocabularies, rules and how to express relations – a great introduction.

Overall, a great talk. He’s also now uploaded his slides, with a video recording of the session pending.

Here they are:

Latest Images

Trending Articles

Latest Images