Revision of Design for fix topicmerge from Tue, 2006-06-20 10:05

Requirement/Problem:
Topic merge in previous releases are written in xsl which calls document() to load different xml files into memory. But xsl transformation engine won't release the document tree once it gets loaded. This caused out of memory exception when merging large documents. Another bug in current topic merge is that the intermediate merged file lost the structure information in dita map and caused the structured toc can't be reflected in output.
Design Description:
Use Java SAX to implement topic merge. When meeting with topicref in ditamap, the processor will read the dita topic and change all of its references to dita topic files into internal references. Then the processor will write the topic content with all of topicref information into the intermediate file.
For example,
a.ditamap looks like
<map>
  <topicref href="a.dita">
    <topicref href="b.dita"/>
  </topicref>
</map>

a.dita looks like
<topic id="a">
  <title>a.dita</title>
  <body>
    <p><xref href="b.dita#b/test"/></p>
  </body>
</topic>

b.dita looks like
<topic id="b">
  <title>b.dita</title>
  <body>
    <p id="test">test paragraph</p>
  </body>
</topic>

The merged output should be like this.
<map>
  <topicref href="a.dita">
    <topic id="unique_1">
      <title>a.dita</title>
      <body>
        <p><xref href="unique_3"/></p>
      </body>
    </topic>
    <topicref href="b.dita">
      <topic id="unique_2">
        <title>b.dita</title>
        <body>
          <p id="unique_3">test paragraph</p>
        </body>
      </topic>
    </topicref>
  </topicref>
</map>

The merged output above will be first stored in memory as string. If there is not additional xsl applied to this result, it will be generated to the merged file. Otherwise the user defined xsl should be applied to the result and output to merged file.

Duplicate resolution:
XSL-FO doesn't allow duplicate id. When a dita file is topicref'd more than once in ditamap, we need to mark up the duplicate version of content in the merged output and assign different ids to them.
During the processing, we will use a hashtable to record the mapping between original id and the unique id we assigned in the merged result. In the exmaple above, the id of the paragraph in b.dita is b.dita#b/test. The unique id we assigned in merge result is unique_3. So there will be an entry in the hashtable. The key of the entry is b.dita#b/test and the value is unique_3.
If we modified the ditamap to include b.dita twice as following:

<map>
  <topicref href="a.dita">
    <topicref href="b.dita"/>
  </topicref>
  <topicref href="b.dita"/>
</map>
then the second b.dita will be record as b.dita(d<random_num>) . In the hash table there will be key value b.dita(d<random_num>) and b.dita(d<random_num>)#b/test. <random_num>is generated randomly. It is not likely to have the same <random_num>for different duplicate content if the author doesn't include the same file in ditamap that much. The possible merged result looks like:
<map>
  <topicref href="a.dita">
    <topic id="unique_1">
      <title>a.dita</title>
      <body>
        <p><xref href="unique_3"/></p>
      </body>
    </topic>
    <topicref href="b.dita">
      <topic id="unique_2">
        <title>b.dita</title>
        <body>
          <p id="unique_3">test paragraph</p>
        </body>
      </topic>
    </topicref>
  </topicref>
  <topicref href="b.dita(d123523)">
      <topic id="unique_4">
        <title>b.dita</title>
        <body>
          <p id="unique_5">test paragraph</p>
        </body>
      </topic>
    </topicref>
</map>

If there is a <xref> in b.dita which refer to the paragraph in the same file the merged result could be:

<map>
  <topicref href="a.dita">
    <topic id="unique_1">
      <title>a.dita</title>
      <body>
        <p><xref href="unique_3"/></p>
      </body>
    </topic>
    <topicref href="b.dita">
      <topic id="unique_2">
        <title>b.dita</title>
        <body>
          <p id="unique_3">test paragraph</p>
          <xref href="unique_3"/>
        </body>
      </topic>
    </topicref>
  </topicref>
  <topicref href="b.dita(d123523)">
      <topic id="unique_4">
        <title>b.dita</title>
        <body>
          <p id="unique_5">test paragraph</p>
          <xref href="unique_5"/>
        </body>
      </topic>
    </topicref>
</map>
The duplicate <xref> should point to the duplicate <p> because they are in the same file.

XML.org Focus Areas: BPEL | DITA | ebXML | IDtrust | OpenDocument | SAML | UBL | UDDI
OASIS sites: OASIS | Cover Pages | XML.org | AMQP | CGM Open | eGov | Emergency | IDtrust | LegalXML | Open CSA | OSLC | WS-I