Personanondata: Yahoo (and Now Google) and The Semantic Web

Friday, February 25, 2011

Yahoo (and Now Google) and The Semantic Web

News from Google about their use of Micro formats reminded me of something Yahoo announced over two years ago. First here is a snip from the Google announcement:

That’s a tough problem with the current web, according to Google’s Jack Menzel, the company’s product management director for search, despite the apparent ease that Watson had besting human Jeopardy opponents.

“We are still grasping for the Holy Grail of natural language search,” Menzel said. “We take the approach that the internet exists, and it is so big and Wild West-like that you have to take it for what it is. It is this giant immutable thing that will do its own thing, despite what you want it to do.”

The dream of a structured web has proven nearly impossible to create in practice as it requires coordination on building specs and then that web page builders take the time to mark their pages up in complicated XML. A more grassroots effort, known as Microformats, has had more success by focusing on just a few kinds of data and making innovative use of HTML, the lingua franca of the web, to simplify publishing meta-data. Google introduced its own suggestions of how websites could start publishing Google-friendly meta-data in 2009 (such as how many stars a rating is), with its so-called Rich Snippets.

And now for the first time, a mainstream search engine is built entirely around webpages that use microformats and other structured data.

So for instance, Google is able to show a searcher only Pho recipes that use tofu that take less than a half an hour to make, not by searching for pages that include the word “Pho” and “Tofu” and “Recipe”, but by actually knowing that a recipe for something called “Pho” has an ingredient “Tofu” and a listed cooking time of 1 hour (for example, the is done after publisher’s wrapping the word “1 Hour” in a defined HTML tag ()and then interpreting that in the search results ).

Here is the repost from March 14, 2008 (and some of what I comment on still applies):

In their continued strategic realignment and adoption of open standards, Yahoo has announced they are supporting a number of semantic web standards that will enable third parties (publishers) to augment and enhance native Yahoo search results. From Techcrunch:

A few details are being disclosed now, and Yahoo promises more in a few weeks. They are saying that they will support a number of microformats at the start: hCard, hCalendar, hReview, hAtom and XFN. They will support vocabulary components from Dublin Core, Creative Commons, FOAF, GeoRSS, MediaRSS, and others. They will support RDFa and eRDF markup to embed these into existing HTML pages. Finally, Yahoo will support the Amazon A9 OpenSearch specification with extensions for structured queries to deep web data.

There is a lot to get excited about in the Yahoo announcement(s) - and in reading the comments associated with the Techcrunch post others are interested as well - but perhaps the best thing to consider is that the weight of Yahoo will press faster adoption of some of these standards. In particular, microformats if adopted by publishers could/would change the rules of content syndication and lead to far wider distribution of publisher content. This in turn would lead to higher pass through traffic generating product or advertising sales for publishers.

I only became aware of microformats in the past six months or so but the concept derives from a practical problem. How often - like me - have you been frustrated by the need to copy down an address, or details of a book review, resume details, or even a cooking recipe. Well, microformats can standardize the manner in which these items are published to the web enabling users like me to access and use this content as uniform packages of information or data independent of the publisher. For example, if I wanted to create a list of recipes from ten different cookbooks, I could assemble these and they would all appear together in consistent form. That would be a huge practical improvement on copy and paste.

Image then how this could impact publishers which publish information that could be disaggregated. This could include every topic from travel to technology to cooking to sewing and knitting. A single dress pattern (Mrs PND believes no one sews anymore but no matter), which typically existed with many others in a book (or magazine) can now be extracted, indexed and monetized. Just think how many discrete elements could exist at a typical publishing house if they were disaggregated from their 'mother' products.

While the benefits of the Yahoo initiatives can be debated, at their core is a potential transformation in the manner in which information is produced and disseminated. The technology is not new, but the weight of Yahoo could propel the adoption and that would be a good thing. While publishers will be slow and cautious (some would say cumbersome), they should become the biggest beneficiaries of this initiative: They own massive quantities of content that has traditionally been packaged to discourage narrow use of content. Microformats and resulting syndication models will open up those content repositories and fundamentally change publishing.

Perhaps I could also add that these changes may enable publishers to reestablish stronger influence over the distribution of their content which has been lost in the physical world to Amazon and B&N. Maybe and perhaps...

1 comment:

Eric said...: It's worth noting that the whole Yahoo effort using microformats got eliminated when Yahoo dropped their internal search in favor of Bing. But Bing incorporated "semantic" technology from the acquisition of PowerSet, while Google acquired structured data aggregator FreeBase. And Apple acquired Siri.

So lots has been happening in this field.; 8:09 AM