Diff for Design for chunking enhancements

--- Mon, 2009-01-12 18:56 by robander
+++ Mon, 2009-01-12 18:58 by robander
 <p>
 &nbsp;
+</p>
+<p>
+******* NOTE *******
+</p>
+<p>
+This design is copied from a design created by Stephen on the DITA-OT development team. The wiki will not let him update the page, and we are having several difficulties with the formatting below, particularly with the diagram and following four steps in section III.
+</p>
+<p>
+**********************
 </p>
 <h3>
 </p>
 <p>
-         if &quot;to-content&quot; is specified in map chunk<br />
+if &quot;to-content&quot; is specified in map chunk<br />
 Use map name as new chunk file name<br />
 if an identical file name already exists<br />
 </p>
 <p>
-          if &quot;by-topic&quot; is specified in chunk attribute<br />
+if &quot;by-topic&quot; is specified in chunk attribute<br />
 Use current topic id as new chunk file name<br />
 if an identical file name already exists<br />
 </p>
 <p>
-           for each entry in changeTable<br />
+for each entry in changeTable<br />
 do<br />
 if randomURL != null &amp;&amp; originalURL.exists()<br />
 </p>
 <p>
-             for each topic file in the updated list of topics
+for each topic file in the updated list of topics
 </p>
 <p>
-             do
+do
 </p>
 parse the topic<br />

Current revision:

Design for chunking enhancements

******* NOTE *******

This design is copied from a design created by Stephen on the DITA-OT development team. The wiki will not let him update the page, and we are having several difficulties with the formatting below, particularly with the diagram and following four steps in section III.

**********************

I. Prologue.

Before we start to discuss the detailed implementation design, let's first have a look at two examples. Here we have: map.ditamap, map.dita, A.dita, B.dita. File B.dita has cross references to map.dita and A.dita. File A.dita has a reference to map.dita and a reference to B.dita. Both A and B have multiple topics, in other words, they are both ditabase files.

*) Case 1: map.ditamap

It should produce an index.html, a map.html containing all contents from both map.dita and A.dita, and a B.html. References to map.dita in A.dita and B.dita should be updated to referencing to map.html, with proper topic ID appended. References to A.dita in B.dita should be updated to referencing to map.html, with proper ID.

*) Case 2: map.ditamap:

It should produce an index.html, a map.html (containing map.dita, B.dita#T1), a B.html and an A.html. All references to B.dita#T1 in A.dita should be updated to map.html#T1 (or equivalent ID). All references to A.dita in B.dita should remain unchanged.

Case 1 shows that:
1) Name conflicts may exist during chunk process. (map.dita <--> map.ditamap)
2) New contents may be added to an existing file. (map.html actually contains all contents from map.dita and A.dita)
3) References in other topics may need to be updated to correct URL after chunk. (In B.dita, reference to A.dita should be changed to map.html)

Case 2 shows that:
1) A dita file may be partially "chunked" (moved into other files). (B.dita#T1 --> map.html)
2) Concrete topics should be taken into consideration when deciding whether to exclude a file from final results. (Although, B.dita#T1 is processed as a chunk, B.dita still needs to generate output since B.dita#T2 is also referenced in the same file)

Thus, examples above revealed 2 important issues:
1) References in topics may be altered after chunk process to point to correct URL.
2) Topics in dita files should also be taken into consideration when excluding extra files.
3) Files may be partially "chunked" or reserved.

II. Design on avoiding generating extra files after chunk process.

Based on these issues, a new list file "canditopics.list" is needed for recording all file URLs that appear as an "href" attribute on any element. A sample fragment of this file may be:

foo.dita#foobar
bar.dita#topicID
foobar.dita

This list is used for denoting all topics that will probably generate output in the final results. Of all these candidate topics, some may be excluded from the final result because they are to be processed in the later chunk step. To strip off these topics, another list "skipchunk.list" is also needed. "skipchunk.list" will denote all "href" attributes that appear together with a "chunk" attribute. Its contents are similar to "canditopics.list", and a sample fragment would be like:

foo.dita#bar
bar.dita#foo
foobar.dita
...

When normal chunk process ends, entries from "skipchunk.list" will be stripped out of "canditopics.list". All the remaining topics should generate output. That is to strip out entries which are not in "canditopics.list" from dita.list and update other lists as well.

To achieve that, we need to modify GenListModuleReader to generate "canditopics.list" and "skipchunk.list" accordingly.

In GenListModuleReader:
1. Add a new HashSet hrefTopicSet for storing all href attributes in any element in a ditamap file.
2. Add a new HashSet chunkTopicSet for storing all href attributes that appear together with chunk attributes in a ditamap file.
3. Add a method for getting hrefTopicSet.
4. Add a method for getting chunkTopicSet.
5. When the current paring file is a ditamap file, add logics to store the "href" attribute into hrefTopicSet in method "startElement". If a "href" value is not empty and scope value is not "external" or "peer" and target file is valid, add it to hrefTopicSet.
6. Add logics for adding elements to chunkTopicSet. If a "href" value appears together with a chunk attribute and scope value is not "external" or "peer" and target file is valid, add it to chunkTopicSet.

In GenMapAndTopicListModule:
1. Add a new HashSet hrefTopicSet for storing all href attributes in any element in a ditamap file.
2. Add a new HashSet chunkTopicSet for storing all href attributes that appear together with chunk attributes in a ditamap file.
3. Add logics for retrieving "href topics" from parsed ditamap files into hrefTopicSet in method "processParseResult(String)".
4. Add logics for retrieving "chunk topics" from parsed ditamap files into chunkTopicSet in method "processParseResult(String)".
5. Add logics for preparing hrefTopicSet for writing out in method "outputResult".
6. Add logics for preparing chunkTopicSet for writing out in method "outputResult".

In Constants:
1. Add a new string "canditopics".
2. Add a new string "skipchunk".

Note: Currently we do have logics to generate href targets indeed, but it only generates file names like (foo.xml) with anchor fragment discarded, which is however what we need. Thus we choose to generate a new file in order not to affect other parts of OT.

Also we need to modify logics in ChunkModule::updateList() method, right before updating dita.list, we need to refine the original topic list with "canditopics.list" and "skipchunk.list":
1. Load canditopics.list into hrefTopicSet.
2. Load skipchunk.list into chunkTopicSet.
3. Remove all entries existing in both sets from hrefTopicSet.
4. For each remaining entry in hrefTopicSet, cut off its anchor fragment. (Now these files are ready to generate output).
5. Write hrefTopicSet to dita.list as "FULL_DITA_TOPIC_LIST" and "FULL_DITAMAP_LIST", replace the old values.
6. Update the list with "changeTable" which may contain newly generated chunk file.
7. Remove any unwanted topic file from the temp dir. (Those are not in the updated list).

NOTE: "changeTable" will be explained later.

III. File name issues in chunk process.

According to chunk specification, When creating new documents via chunk processing, the storage object name or identifier (if relevant) is taken from the copyto attribute if set, otherwise the root name is taken from the id attribute if the by-topic policy is in effect and from the name of the referenced document if the by-document policy is in effect. Unfortunately, it does not mention what to do if a name conflict appeared. Therefore we shall follow the naming steps as:
1. Try to use a specified name, (such as copyto value), if any.
2. If chunk="to-content" is specified in map element, then chunk result should first try to use the ditamap's file name.
3. If "by-topic" is specified in chunk attribute, then chunk result should first try to use topic id as file name.
4. Otherwise use name of the referenced file which is to be chunked.
5. If there were name conflicts:
a) Mark this conflict.
b) Generate a random file name and associate it with the orignal file name.
c) Use the random name to complete chunk process.
d) When we have removed unwanted extra files, for each name conflict, if its original file name still exists, it means this conflict cannot be resolved so the random name is used as new file name. Otherwise, rename the randomly named file to its original name.

In order to track name conflicts, the original "changeTable" structure needs to be modified to meet our needs.

+--------------
| Change Table
+--------------
|
|       +------------------------------------+      +-------------------------+
+----+ Original File URL (String) +---->| Target File URL (String)
+------------------------------------+      +-------------------------+
| Random File URL (String)|
+-------------------------+

Key                             Value
         if "to-content" is specified in map chunk            Use map name as new chunk file name            if an identical file name already exists              Use a randomly generated name as chunk file name              Associate target name and random with original name            end if          end if          Do normal chunk process

A name conflict can be denoted as an association bewteen original URL and target URL. Original URL denotes the original file URL as a reference URL in "href" attribute. If a topic is processed into a separate chunk file, all references to the topic should be updated to point to the new target URL in the chunk file. If unfortunately the target file is already existing, a random file name will be used and associated with the original one because we don't want to lose that infomation which is needed for resolving conflict later.

In addition, we need to update all references in all topics according to "changeTable" because topics may be moved between files and references should follow the chunk results.

To achieve that we need to:
1. Modify logics in method "read" of ChunkMapReader:

if "to-content" is specified in map chunk
Use map name as new chunk file name
if an identical file name already exists
Use a randomly generated name as chunk file name
Associate target name and random with original name
end if
end if
Do normal chunk process
...

2. Modify logics in method "startElement" of ChunkTopicParser:

if "by-topic" is specified in chunk attribute
Use current topic id as new chunk file name
if an identical file name already exists
Use a randomly generated name as chunk file name
Associate target name and random with original name
end if
end if
Do normal chunk process
...

3. Add a method "resolveNamingConflicts" in ChunkModule, which will be called right after "updateList" to resolve name conflicts. The method completes the following tasks:

for each entry in changeTable
do
if randomURL != null && originalURL.exists()
targetURL = randomURL
else if !originalURL.exists()
remove originalURL
rename originalURL to targetURL
end if
targetURL = null

4. Add a method "updateTopicRefs" in ChunkModule, which will be called right after "resoveNamingConflicts" to update topic references. It completes the fllowing tasks:

for each topic file in the updated list of topics

parse the topic
if current element contains "href" attribute
if "href" value is a key in "changeTable"
update "href" value with changeTable.getTargetValue("href" value)
end if
endif
done

NOTE: Parsing all topic files again looks very bad but I think it is inevitable. Chunk process needs all files present during the whole process. Later, some of them may disappear or be renamed after chunk process finishes. However, it can not be determined if a file will or will not remain unchanged until all chunk process finishes. Therefore it would be a must to update all references in another freshly new parse.

VI. Summarization.

To summarize all, measures to solve chunk problems would be suggested as:
1. Generate "canditopics.list" and "skipchunk.list" in GenMapAndTopicListModuel.
2. Mark name conflicts in "changeTable" during chunk process.
3. Update dita.list with "canditopics" and "skipchunk" together with "changeTable".
4. Update all references in topics according to "changeTable" in the end of chunk process.