Information set proposal

I'm coming around to the idea that we can't handle all of the dependency tracking with a specialized map. But, I'm wondering if we can come up with an approach that still leverages maps for the dependencies they express and supports the continuum of approaches.

Proposal: We introduce a new kind of XML document that identifies the source files belonging to an information set. The basic structure is very simple:

<infoset>
    <data ...> ... </data>
    ... more data about this infoset ...

    <source href="filename" modified="2006-5-2 20:05:01">
        <data ...> ... </data>
        ... more data about this source ...
    </source>

    ... more sources ...

</infoset>

The <data> element would be the same as DITA 1.1 for cognitive reuse if nothing else.

So far, this representation could be used in the same way as the existing build list. The representation is just more processable because it is XML, so you could retain this artifact for later processing.

For the next level of use, the <source> element can not only refer to the source file but store a snapshot of the source file or some subset. We can also specialize the elements so we can have clarity and extensibility and can validate (if desired).

By pulling in the map and a subset of the topic, we have a complete declaration of the dependencies within the information set:

<infoset>
    <mapSource href="mapName.ditamap" modified="...">
        <map>
            <topicref href="parentTopic.dita">
                <topicref href="childTopic.dita"/>
            </topicref>
            ...
        </map>
    </mapSource>
    ... more maps ...

    <topicSource href="conrefSource.dita" modified="...">
        <topic id="conrefSource">
            <title>Conref source</title>
            <body>
                <keyword id="productName">Product name</keyword>
            </body>
        </topic>
    </topicSource>
    <topicSource href="xrefSource.dita" modified="...">
        <topic id="xrefSource">
            <title>Xref source</title>
            <body>
                <section id="sectionID">
                    <title>About <keyword conref="conrefSource.dita#conrefSource/productName"/></title>
                </section>
            </body>
        </topic>
    </topicSource>
    <topicSource href="dependentTopic.dita" modified="...">
        <topic id="dependentTopic">
            <title>Dependent topic</title>
            <body>
                <xref href="xrefSource.dita#xrefSource/sectionID"/>
                <image href="images/picture.gif"/>
            </body>
        </topic>
    </topicSource>
    ... more topics ...
</infoset>

Note that external objects such as images are identified.

Capturing this information in the <infoset> gives us everything we need for incremental builds. When a file changes, we only have to update the <infoset> and the output files for the changed file and any files with dependencies on the changed file.

The process that updates the infoset would probably normalize the class attributes so we don't have to validate.

Variation 1: We could pull in the entire topic instead of just the elements that are the source or target of dependencies.

For someone who doesn't mind keys in their XSLT, that could meet the same requirements as a merged map but without duplicate topics.

Effectively, this populated infoset document provides the poor man's portable CMS.

You could run an XML Diff on the old and new infoset to capture the deltas for efficient propagation of updates to another site.

If you really had to, you could even pull in encoded external binaries and pipe the entire infoset as a single stream.

Variation 2: You could have <source> elements that refer to Ant build targets or even contain snapshots of the Ant build files. Interesting things become much easier to understand, like running an Ant build on the infoset and using XSLT to convert an Ant template into a subordinate Ant file out of the infoset and then executing it in the continuation of the Ant build.

Footnote: Probably, we should call them something other than infosets to avoid confusion with the XML infoset.

I need to come at this from a different direction...

What affects how a build-result looks?
* The content of the map(s) in it.
* The content of the topic(s) that the map(s) refer(s) to.
* The content of the @scope="local" resources that the map/topic(s) refer(s) to.
* The values of parameters supplied to the build process (transtype, args.*, output.dir, ...)

I am particularly interested in that last one.  If I build foo.ditamap->PDF then I cannot assume that I have done anything at all to further a foo.ditamap->XHTML transformation.  Even something innocuous like conref resolution might depend on the transformation type.  The same is true of every build parameter: if I change the output directory then I might have to rebuild everything Just In Case.  This is not sounding encouraging.

This is made so much harder by the pipeline-between-files nature of DITA-OT.  The intermediate forms (e.g., after topicpull but before conref) are not stored.  That means that conref now has topicpull as a dependency too, and all of topicpull's dependencies on top of it.  Look at it that way and it might be that the final output depends on every piece of input.  You might not be saving anything.

I'm envisaging some alternative to DITA's pipelining where the pipeline stages are stateless and idempotent.  Run the same transformation, on the same input parameters, and you'll get the same output.  _Even on the smallest piece of the pipeline._  This is not currently guaranteed.  I'm starting to think that it has to be if we are to get anywhere decent.

I think that to get this, every pipeline stage needs to be self-aware, and be able to declare which inputs it uses, and which output it produces.  If this is the intention of <infoset>, then good, we are on the same wavelength.  Looking at it this way, I don't see an argument for <infoset> to cache "frequently used" values - done right, these values won't be frequently used because they will be accessed only when they change, and caching is very hard to do right.  I vote for getting the design right, burning as many bridges as necessary, and *then* worrying about whether we should/can cache values.
XML.org Focus Areas: BPEL | DITA | ebXML | IDtrust | OpenDocument | SAML | UBL | UDDI
OASIS sites: OASIS | Cover Pages | XML.org | AMQP | CGM Open | eGov | Emergency | IDtrust | LegalXML | Open CSA | OSLC | WS-I