The Wisdom of Crowds Meets the Wisdom of Authors: How XML Enables the Semantic Web
By my colleague, Paul Wlodarczyk the VP of Solutions Consulting for JustSystems:
I recently attended the first-ever Linked Data Planet conference, where a number of pioneers in the field of Semantic Web shared their perspectives on the state of the art — and business — of helping the world tag their web pages for meaning. So what is the Semantic Web and how is it different from the web of today? On the web, most search engines today use key words and the number of links to a page to determine the relevance of search results. This is the wisdom of crowds at work: If the key words you are searching for occur often on that page, and the page is popular (i.e. lots of links to it), then it is probably the best bet for what you are searching for. The downside of this approach is that it infers meaning of the page. On the Semantic Web, the crowds get wiser thanks to the wisdom of authors, who can let the crowds know — in no uncertain terms — what their content means.
For example, when “New York” appears in an HTML document, it could mean New York City, New York State, the Yankees, the Mets, the Giants, the Jets, the song, the strip steak, the state of mind, etc. You get the idea. Words are ambiguous when taken out of context.
If I’m writing about a sporting event, the context of the article lets you know that “New York” means a specific team. The typical search engine, however, doesn’t recognize context. To a search engine, “New York” is just a string that occurs in the document with some frequency.
Key to the Semantic Web is semantic markup, which lets users annotate their web pages with metadata — HTML attributes that don’t get displayed in the document. Semantic metadata describes what the pages are about, letting authors define things with authority and precision.
In my “New York” document, I can state that the document is about the sports team, not the steak. I can do this by tagging the named entities in the document — the people, places, things, events, and facts — in an unambiguous way. I can also set those entities into relationships with each other. If part of my document refers to a player trade between the New York Yankees and Oakland A’s, I can tag the Yankees (entity number one), the A’s (entity number two), and the player trade (an event, but also a relationship between the two named entities).
Overcoming the semantic hurdles. While semantic tagging gives documents unambiguous meaning, it has traditionally faced two large hurdles. First, adding semantic markup has been relatively expensive, in terms of either labor or technology. Second, the market for consuming this markup has been small. Both of those hurdles are rapidly falling away.
Let’s address the second point first. Yahoo! has introduced SearchMonkey, a new technology that rates web pages. Rather than use keywords and number of links to the page (the wisdom of crowds), SearchMonkey finds web pages using the semantic markup that is embedded in the page (the wisdom of authors). This creates a substantial motive for adding semantic markup — search engine optimization. Semantic markup makes your content more likely to be found and more relevant to the searcher.
Marking up existing content. Which brings us to the first point: How do you add semantic markup? For legacy content, you need to use some combination of people and automation. Using people to tag existing content requires specialized skills that are in short supply. But some interesting technologies for auto-tagging content are emerging. Thomson Reuters’ Calais is a great example. For a demo, visit http://sws.clearforest.com/calaisviewer/, and try pasting some text that describes your company. I did and Calais accurately tagged all named companies, legal entities, products, technologies, countries, cities, and correctly identified a product acquisition as an event.
Going forward, I recommend adding semantic markup to new content as it is created. We’ve been doing this with XML and SGML documents for years, using semantic tags to unambiguously flag specific pieces of text for future discovery. Tagging part numbers in a service manual, for example, can automate the addition of hyperlinks and improve search relevance. Tagging an annual report with XBRL can automate the discovery of specific facts within the management discussion and analysis or the footnotes, helping to prevent another Enron.
Semantic tagging and content reuse. More recent XML standards, like DITA, help authors focus on creating granular, component-based content, primarily for content reuse. In addition to providing definitive metadata, DITA encourages organizations to break their content out of the document model. Storing and retrieving content in documents makes the facts and events inside those documents harder to discover across the enterprise.
Think of a lengthy policies and procedures manual. Historically, individual policies and procedures have been bound together in one book for the convenience of publishing. Adding or revising one policy means the whole manual must be reprinted and distributed. Today, electronic publishing on the web, intranets, and portals make it possible to publish each policy or procedure individually, as it is added or revised. The book itself is obsolete.
A very large document like a policy manual used to be managed in a document management system as a single, monolithic document. Now, it can be managed in a content management system as a collection of hundreds of reusable DITA topics. And how do you effectively manage large quantities of DITA topics? By specifying metadata for each topic, so you can find it again. Just like semantic markup and the Semantic Web. The same technologies that evolved for adding semantic markup to web pages can be used to add it to DITA. DITA is already in widespread use as a way of authoring content that can be rendered to web pages, so the semantic markup can be inserted as part of the publishing process, with no special skills required on the part of the author.
Clearly, combining semantic markup with a granular authoring approach like DITA holds a lot of promise for content creators and consumers alike. Content becomes easy to define and even easier to discover. The combination also holds a lot of promise for the future of the Semantic Web itself. In fact, creating the Semantic Web might be as easy as authoring content in DITA.
---
Paul Wlodarczyk is vice president of solutions consulting for JustSystems, the largest ISV in Japan and a worldwide leader in XML and information management technologies. Learn more about JustSystems at http://www.justsystems.com, and contact Paul at paul.wlodarczyk@justsystems.com.