Daylights: Ya need an adapter!

Day Lights

One of the joys I get from working with the DITA Open Toolkit is seeing the extent to which it helps run the production business of so many writing teams.  The DITA OT team has done a lot of work to set up some major interfaces in it to make it run in more places, with more options and fewer restrictions, and to make it both embeddable and pluggable, greatly increasing its versatility for both individual users and for commercial authoring tools.

Shortly I'll be starting up a Wiki page here for the DITA community to collaborate on the next major features and functions that will be going into DITA OT 1.4, the next major release.  What new transforms should be considered? What new utilities, or specializations would bring value for us? I am looking forward to getting some early and clear input on what the growing team of OT developers will be working on.

One yet-unprovided feature in the Toolkit is any sort of interface to storage systems other than straight file systems.  We'll never lose that original premise of the Toolkit--that DITA's reuse architecture works just fine in a file-based environment, meaning that you can get even phrase-level reuse without a content management system.  It is so important that demos and light-weight applications can be done with tools that are free or of very low cost.  DITA should be accessible, and I believe that the fullpackage installation has, for the first time, enabled novice users to do productive things within minutes instead of hours as before.  But there is still room for strategic improvements, one of which is the need to move beyond the file system.

For DITA OT 1.4, I would like to entertain the thought of defining a basic CMS adapter that is provided as part of the Toolkit.  The idea is that this interface would represent the core services that all CMS providers could ensure work exactly the same way across different vendor's systems.  Anything beyond that set of services is part of the value-add upon which these systems compete, but the common core ensures that you can shop for a CMS like you would shop for lenses for a camera, knowing that any brand of lens that has your camera's mount will work with your camera.  Because this adapter would be part of the toolkit, any editor or Help Authoring Tool that embeds the Toolkit would be immediately adaptable to any conforming CMS. Consultants, does this appeal to you?

In reading about CMS (about which I actually know very little), I'm aware that this is not a new idea. I recall the Astoria adapters for FrameMaker and Author Editor way back in SGML days.  What IS new is that there are several standards to pick from, such as Open Service Interface Definitions and others. I don't think that any invention is needed for this requirement, just some collaboration and agreement on which of several very good standards to use, and what the core definitions will be. The rest, as they say, is a Simple Matter Of Programming.  Right.

What are your thoughts? I would like to hear from users who have had good/bad experiences with CMS or file management integrations, IT folk who have had to code their own adapters, vendors, consultants, anyone who can help us decide whether to include this interface with the next Toolkit.

-- Don Day

While we're talking about possible new transforms, utilities, etc., one problem we've hit (especially with our customers) is that the graphic file formats supported by the open tool kit are pretty llimited. This is especially true for folks that output both on-line deliverables and PDF for print. JPG and GIF files are wonderful for on-line rendering and are the standard for things like HTML compiled help. However, JPG and GIF -- especially those saved at 72dpi or 100dpi -- are unacceptable for PDF rendering when that PDF file is slated to be actually printed. Raster file formats in general are not the best for print production. They cannot be scaled and the usual low resolution quality makes the print output unacceptable for many applications. Print output usually utilizes a vector format -- with SVG and WMF preferred as most PDF rendering engines can handle these without having to add an additional step of using Acrobat Distiller (which is required if using EPS files).

It would be great if the Open Toolkit would accept vector graphic formats as input and convert these to an appropriate raster format when the output is "on-line".

Without this facility, reuse of topics for multiple deliveries requires the management of two distinct graphic files for one illustration (one raster for on-line, one vector for print). Or, for folks not inclined to this method, no reuse at all with distinct topics for print and for on-line. Or, you could have extra "conditional" processing where you would have two distinct image tags with distinct select attribute values based on the delivery target. Any of these options adds another level of management and complexity to the overall dita environment. Automatic conversion of vector to raster would eliminate this added layer of complexity and allow for a single source of all graphic files with the appropriate format used for specific deliverables.

Which is a large can of worms.

Lots of things called a "Content Management System" are intended purely to support dynamic content; they have no native ability to produce fixed documents. (PDF, print, etc.)

Some things called a CMS are document management systems with limited support for authoring; some other things are intended to support authoring with true multi-channel publishing, to both fixed and dynamic output types.

I don't think it's wrong to call any of those things a CMS, but they're very different applications, and DITA is general enough to support all of them. (I'm fairly sure there's at least one case study for each of those types of application now known.)

I'd say there are three critical issues, all around some kind of process control:

  1. metadata
  2. resource locations
  3. output generation

In a good object model, metadata ought to live in the object, which implies in the DITA case the map or the topic. However, you don't, as a CMS maintainer, necessarily want to make the metadata writable by all the CMS users (consider hand-edited topic IDs in a CMS system which has automation to guarantee topic ID uniqueness, for a simple case); this implies that we ought to look at some sort of metadata container visibility selection. (All-some-none?)

Resource locations are a pain if there's a presumed file system, because you get issues about things like finding image files; right now, this runs into constrains of type suffix (screenshot.bmp and screenshot.png are inherently different and break all your links, even if it's two versions of exactly the same item on the screen), plus all the hardcoded locations in the toolkit processing chain. It would be much better to have some kind of abstract resource reference available, one in which you can combine a type and a name without either directly referencing a file. (Because it might not be a file; it might be a DB retrieve call of some kind.)

Output generation will get done by some sort of scripting, whether Ant or something else entirely; I don't think that's a CMS issue.

I do think there's an issue around output readiness -- is this DITA content object in a fit state to be output as final, ship-to-customer output? Whose approvals has it got? Do we want to hold up print because we haven't got the approvals for web?

That sort of thing is appropriate for a CMS to keep track of; it might not (I would tend to think not) be appropriate for DITA to keep track of, but if so it might be best to assert that explicitly. (Especially since DITA has attributes which can be taken as referencing content quality.)

The things we had serious problems with in our own CMS implementation were the resource locator issues.

A related discussion has just been started on the Yahoo! groups framemaker-dita forum:
CMSs with concurrent checkouts, branching, merging; also XML (http://tech.groups.yahoo.com/group/framemaker-dita/message/1027) and also cross-posted in dita-users (http://tech.groups.yahoo.com/group/dita-users/message/4117)--both are worth watching.   What this impressed on me is that we can define the expected services of an adapter in terms of function needed, in Hedley's case, concurrency and delta tracking. Since these kinds of functions are provided for by WebDAV and perhaps other standards, I'm thinking that they should help define the common services that a Toolkit interface would provide to whatever content manager a user may choose for their overall system.  Does it make sense to specify a common interface first in terms of common user-level services and expectations?

--
Don Day
Chair, OASIS DITA Technical Committee
Project Lead, DITA Open Toolkit
IBM Lead DITA Architect
from my experience one of the biggest limitation of DITA OT is the
dependency to file system resources. this can be solved as already
mentioned using some kind of abstraction for resource access, best case
based on existing standards and implementations.

JSR 170 might be a sufficient standard for this purpose. DITA OT only
needs a subset of featureset defined in the specification. drawback:
only java based, resource resolution must be also available within xslt
processor which requires modification or tricky implementation for each
standard non java task within the transformation pipe.

usage of WebDAV might be another option which would be sufficient
enough to abstract access to CMS resources. this only requires to
enable DITA OT to deal with http resources. most CMS vendors today
providing http based access to the resources stored within their
repository. this method is also transparent to xslt and therefore can
be used without any modification to the transformation engine in place.

beside providing an resource abstraction to DITA OT there is another
option might be considered. usage of xml pipeline approach or in other
words moving from resource and dependency driven approach to xml
infoset streaming approach.

this kind of design is much more scalable / extensible in my point of
view, because each step in the transformation pipe is based on xml
infoset instead of an resource which needs to be interpreted in each
step of the pipe. i'm not focused on performance issues but more on
functional scalability.

in this sceanrio resource access is only required and provided by one
dedicated processor within the pipe, the rest of the pipe stays
unchanged, because its based on created stream of  xml infoset.

a certain vendor only has to provide a implementation of a processor
follows a public interface or the used platform already provides
resource processors a cms supports (like WebDAV, WebServices, ....). it
also enables DITA OT to use additional xml based transformation task
beside java and xslt, like XQuery, XUpdate,.... based on specific
requirements on certain transformation task without adding additional
complexity. because the only requirement to a certain processor is to
consume and/or produce xml infoset the overall design and connection of
components is extensible by design. maintenance of the OT should be
much easier, because there are several smaller and modular chunks of
transformation dedicated to one requirement.

by the way the approach must be supported by implementation available as open source. the best implementation of xml pipe i'm aware of is
provided by Orbeon. their framework is based on XPL, a pipeline
definition language. the xpl processing engine can be either used
client side or server side integrated within their presentation server
which provides server side processing of transformation pipe.

W3C currently working on XML Processing Language which is quite similar to process model behind XPL (Alex and Eric are strongly involved in driving the
standardization of XProc). hopefully this standardisation process will be successful, but right now it looks very promising to be.

changing the processing approach requires some kind of redesign but it
would be worth to at least discuss the pros and cons for at least long
term development.

alex


--
The ultimate function of prophecy is not to tell the future, but to make it. Your successful past will block your visions of the future.
XML.org Focus Areas: BPEL | DITA | ebXML | IDtrust | OpenDocument | SAML | UBL | UDDI
OASIS sites: OASIS | Cover Pages | XML.org | AMQP | CGM Open | eGov | Emergency | IDtrust | LegalXML | Open CSA | OSLC | WS-I