Content Management Meta-Data Stratigraphy, Topic-based Authoring, and DITA

A clever friend of mine suggested that I write some blog posts about DITA, or, rather, my experiences putting a DITA Content Management System into production use, since there's already a fair bit out there about DITA as an idea. I've been doing that on my personal blog, but it occured to me in the typical fashion of sudden rushes of intelligence to the head that putting those posts up here would get those posts to a wider audience.

I'm going to assume you know what DITA is (the Darwin Information Typing Architecture, an XML vocabulary for technical authoring) and why it might be interesting (specialization by modified descent of existing elements to support more precise semantic meanings, support for topic-based and scenario-based authoring). One has to start somewhere, and the OASIS and sites have what-is-this? explanations.

Why Stratigraphy?


I'm going to start with content management meta-data stratigraphy, which is just the increasing sequence of how much you need to know about something—call it a content object, so we can avoid getting tangled up in the question of what, exactly, is a document or a file—without having to read the content or ask a human being. The higher layers build on the information in the lower layers, and order is not merely important, but crucial, so, stratigraphy. The kind of authoring you can do with DITA is fundamentally set by the type and amount of meta-data your content management system can store in association with a content object, and this is the first decision you get to make about using DITA to meet your information delivery needs.

Basic Version Control: Who Did What, When?


Initially, at the very bottom, and the best-recognized level (because this is what software projects do), there's basic version control. Assuming you have access to the content management system, you can count how many times this content object has been checked in (depending on terminology, perhaps "committed") to the server, so you can tell version four from version nine and (probably) who checked-in version four or version nine. Depending on the version control system, you may be able to compare the contents of version four and version nine to see what's changed, but pretty much all version control systems were designed to manage software, and have a very strong line-wise bias. Line-wise comparison isn't so good for comparing XML content, where line breaks are not significant and you want to make comparisons based on the XML elements.


Single Sourcing: One Content Object, Many Uses


Built on top of version control you get to single-sourcing, where there is only one copy of any specific content object and everywhere that content is used references the single copy. It's possible to do basic single-sourcing with nothing but version control. This is the same as assuming that the version you want to reference when building a content delivery (staying away from "document" again; maybe the content delivery is one large PDF, or maybe it's a large number of HTML files) is always the most current version of the content object. This can be made to work, but it has several drawbacks:
  • you don't know which content delivery might be referencing this content object, so you can't decide how safe it is to change without asking everyone who might be using it. At best that's slow; at worst it's impossible. You might guess based on who checked it into the version control system last, but just because the last change was four months ago you can't safely assume it's not being used! Many standard text elements (legal disclaimers, safety warnings, etc.) change only rarely.
  • you don't know anything about the completeness of the content. So you can't re-use content without finding the person who checked that content object in last and asking them "is this done yet?" (Potentially, "will this be done in time for me to use it in my content delivery which has thus-and-such a deadline?")
  • there is no way to reference a specific version of a content object, in those cases, where you are sure you don't want to reference the current version of a content object. This means you can't do things like maintain a copy of the as-delivered content delivery within the content management system, because you have no way to say "I mean version 5, not whatever changed version of that content object we've reached a year later" in your references.
For single sourcing, in addition to knowing who checked the content object in (and when), and what revision number this is, you need to be able to determine from the content object meta-data:
  • which objects reference this content object, which answers "can I delete this content object?" and "can I safely change this content object?"
  • the "information quality" (complete, draft, reviewed, etc.) of the content object, which tells you if you can ship this content object in your content delivery. (You don't want to ship an abandoned, two quarter old, unchecked first draft, and you should not have to hunt someone down and ask them to find out if that's what this content object is.)

You also need a reference system that can reference specific versions of a particular content object, and not just the most current version. So far, you can do version-controlled and single-sourced content delivery with just about anything that supports importing content by reference; you do not have to use DITA specifically or XML generally. Support for more than one layer of references is best (this lets out effectively all desk-top publishing programs), but single sourcing can be done using import-by-reference using almost anything; it doesn't have to be XML. (Framemaker, Word, Open Office, LaTeX, troff, plain-text markup like ReStructured Text or Markdown, text files pulled into a layout program like InDesign or Scribus ... )


Form/Content Separation: The Container is Not the Content


The thing you cannot do without XML [1] is form/content separation, where the eventual visual (or other) presentation of the content is entirely independent from the content itself. Form/content separation requires semantic tagging—labelling content by function, rather than by appearance—in the content, and semantic tagging requires XML. (Or something else with the properties of XML markup, but no one has built that and doing it yourself from scratch would be a truly ridiculous amount of work.) It also requires a mechanism to process the semantically-tagged content into one or more eventual delivered formats, so that the XML <title/> element ("<title>Introduction</title>") becomes, for example, <h1>Introduction</h1> in HTML output and sixteen point, bold, san-serif font text with 20 points of spacing after it in PDF output.[2] Form/content separation isn't it's own layer in the stratigraphy; it doesn't impose any additional requirements for content object meta-data over and above that required for single-sourcing. It does require that you have output processing able to take your XML and convert it into whatever delivered format you use. (Output processing as a subject will eventually get a bunch of these posts; I'm not going to try to cover it in a couple of sentences!) Form/content separation is a requirement for effective topic-based authoring; until you have form/content separation, you can't support the key property of topics, which is that they can be arbitrarily re-ordered without changes to their content.


Topic-Based Authoring: Topics are Content; Organization is Another Kind of Object


The essential thing about topics is that you can re-arrange them without reference to their content. So it does not matter if a topic is the first referenced content, or is deeply nested halfway through the content delivery; the topic will process properly without there being any need to change its content. (Compare to the need to manually adjust styles, or to have duplicate content objects with different styles, when using single sourcing with content objects created using a DTP program.) Put another way, this means that topics are strictly content; they don't contain information about the structure that organizes the topics into a content delivery. This means you need a new kind of content object that consists of an organization of topics. In DITA, these are called maps. To support topic-based authoring, your content management system needs to include maps, and include the same meta-data about maps that it does about topics. You may also want to associate extra meta-data with maps to support things like variables (for instance, if the "Product Name" variable is associated with the map, you can re-use the same topic with multiple product names by using it in multiple maps, something that can be very helpful when providing content deliveries for similar products), delivery-specific information like company name or copyright year, or references to content that should be associated with this map when it's processed, such as cover art. [3] DITA as an XML vocabulary is intended to support topic-based authoring. Using topic-based authoring nets you most of the initial productivity gains from DITA, and you should be planning to get from wherever your writing process is to topic-based authoring if you're adopting or planning to adopt DITA as your writing mechanism. (Specifics about productivity gains will be another post; the very short version is that you should expect per-writer productivity to at least double.) Once you have topic-based authoring, you've got both form/content and content/structure separation. The production of information content and the arrangement of that content into a content delivery are distinct jobs (although the same person might do both of them), and this opens up the possibility of scenario-based authoring.


Scenario-based Authoring: Tell Me What You Need to Know


Scenario-based authoring makes the starting assumption that you want to deliver information based on known customer needs. You may or may not continue to document everything about your product, but you'll focus your attention on the work your customers are most concerned to perform using your product. This is done by analysing the work in terms of roles—who does it—and goals—what the work is. Roles and goals are made specific and concrete in terms of a persona—a fictional person who performs the role—and a scenario—a specific instance of meeting the goal. So if the role is "engineer" and the goal is "test the product", the persona would give a specific engineer, with a name, a specialization, background, degree of experience, and so on, and the scenario would involve a specific product version, specific test equipment, and so on. In order to do this, you need to be able to both store personae and scenarios, and to associate these with specific DITA maps. This allows the map to be created so that it fulfils the scenario or scenarios associated with it. This is turn allows you to negotiate with your customers about what they most need to know, hopefully permitting you to cause your customers to see the documentation you deliver as beneficial or essential to their ability to meet their own business objectives. While such negotiation is possible with any kind of documentation production technology, including quill pens and parchment, the speed and flexibility of topic-based authoring is what makes it possible to respond to the differing needs of diverse customers inside the time available in a modern rapid product life-cycle at an acceptable effort cost. DITA does not have formal support for associating personae and scenarios, but it's flexible enough that this is not a problem in practice, provided that the content management system allows treatment of your stored personae and scenarios as some form of meta-data, preferably (personae get large!) as meta-data by reference. So, that's the meta-data stratigraphy your content management system will need to deal with going from support of basic version control through single-sourcing to topic-based and then scenario-based authoring. If the content management system you're using does not support the minimum amounts of information at each level of the stratigraphy, you're not going to be able to successfully move to that level in your process. Associated with the meta-data stratigraphy, the reason to adopt DITA is to get to topic-based authoring; make sure the content management system you're adopting along with DITA will support enough meta-data to get you to the topic-based authoring level.


Change Budget: A Whole 'Nother Problem


These things must be done in order. Each stage in the stratigraphic progression is a substantial change away from document-focused, narrative authoring, and about as much change as a writing team will be able to absorb during a single product life-cycle while continuing to meet deadlines. Put another way, moving up one layer in the content-management stratigraphy will use up the whole change budget of each of the people involved for that product cycle. Each change of layer is also a step toward becoming reliant on the search capabilities of your CMS to be able to find and reference your content, and it's critical that your content management system support effective (and XML aware) searching of content object data and meta-data as you move through this progression. The gain is greatly increased productivity (you should expect productivity to increase by better than a factor of two), consistency, and repeatability in information delivery.

[1] SGML will work, too, but XML is a proper subset of SGML and exists primarily so no one has to deal with the full and complex glory that is SGML.

[2] PDF output involves converting XML to something called XSL-FO, "eXtensible Stylesheet Language Formating Objects", and then feeding the formatting objects through specialized formatter software to get PDF output.

[3] I'm leaving out images for now, but they make up another content type. Focus Areas: BPEL | DITA | ebXML | IDtrust | OpenDocument | SAML | UBL | UDDI
OASIS sites: OASIS | Cover Pages | | AMQP | CGM Open | eGov | Emergency | IDtrust | LegalXML | Open CSA | OSLC | WS-I