Diff for Design for chunking enhancements

Mon, 2009-01-12 18:38 by robanderMon, 2009-01-12 18:51 by robander
Changes to Body
Line 6Line 6
 
I. Prologue.</h3>
 
I. Prologue.</h3>
 
<p>
 
<p>
-
  <br />
  
-
  Before we start to discuss the detailed implementation design, let's first have a look at two examples. Here we have: map.ditamap, map.dita, A.dita, B.dita. File B.dita has cross references to map.dita and A.dita. File A.dita has a reference to map.dita and a reference to B.dita. Both A and B have multiple topics, in other words, they are both ditabase files.<br />
  
 
<br />
 
<br />
-
   *) Case 1
+
Before we start to discuss the detailed implementation design, let's first have a look at two examples. Here we have: map.ditamap, map.dita, A.dita, B.dita. File B.dita has cross references to map.dita and A.dita. File A.dita has a reference to map.dita and a reference to B.dita. Both A and B have multiple topics, in other words, they are both ditabase files.<br />
  +
<br />
  +
*) Case 1: map.ditamap
 
</p>
 
</p>
-
<pre>
+
&lt;map chunk=&quot;to-content&quot;&gt;                             <br />
-
                        map.ditamap
+
&lt;topicref chunk=&quot;select-topic&quot; href=&quot;map.dita&quot;/&gt;   <br />
-
</pre>
+
&lt;topicref chunk=&quot;select-topic&quot; href=&quot;A.dita&quot;&gt;      <br />
-
<pre>
+
&lt;topicref chunk=&quot;select-topic&quot; href=&quot;map.dita&quot;/&gt; <br />
-
   +--------------------------------------------------------+
+
&lt;/topicref&gt;                                        <br />
-
</pre>
+
&lt;/map&gt;                                               <br />
-
<pre>
+
<br />
-
   | &lt;map chunk=&quot;to-content&quot;&gt;                               |
+
<br />
-
</pre>
+
-
<pre>
+
-
   |   &lt;topicref chunk=&quot;select-topic&quot; href=&quot;map.dita&quot;/&gt;     |
+
-
</pre>
+
-
<pre>
+
-
   |   &lt;topicref chunk=&quot;select-topic&quot; href=&quot;A.dita&quot;&gt;        |
+
-
</pre>
+
-
<pre>
+
-
   |     &lt;topicref chunk=&quot;select-topic&quot; href=&quot;map.dita&quot;/&gt;   |
+
-
</pre>
+
-
<pre>
+
-
   |   &lt;/topicref&gt;                                          |
+
-
</pre>
+
-
<pre>
+
-
   | &lt;/map&gt;                                                 |
+
-
</pre>
+
-
<pre>
+
-
   |                                                        |
+
-
</pre>
+
-
<pre>
+
-
   +--------------------------------------------------------+
+
-
</pre>
+
 
<p>
 
<p>
-
   It should produce an index.html, a map.html containing all contents from both map.dita and A.dita, and a B.html. References to map.dita in A.dita and B.dita should be updated to referencing to map.html, with proper topic ID appended. References to A.dita in B.dita should be updated to referencing to map.html, with proper ID.<br />
+
It should produce an index.html, a map.html containing all contents from both map.dita and A.dita, and a B.html. References to map.dita in A.dita and B.dita should be updated to referencing to map.html, with proper topic ID appended. References to A.dita in B.dita should be updated to referencing to map.html, with proper ID.<br />
-
   <br />
+
<br />
-
   *) Case 2
+
*) Case 2: map.ditamap:
 
</p>
 
</p>
-
<pre>
+
&lt;map&gt;                                                        <br />
-
                       map.ditamap:
+
  &lt;topicref chunk=&quot;to-content select-topic&quot; href=&quot;map.dita&quot;&gt; <br />
-
</pre>
+
    &lt;topicref chunk=&quot;select-topic&quot; href=&quot;B.dita#T1&quot;/&gt;        <br />
-
<pre>
+
  &lt;/topicref&gt;                                                <br />
-
   +---------------------------------------------------------------+
+
  &lt;topicref href=&quot;A.dita#T2&quot;/&gt;                               <br />
-
</pre>
+
  &lt;topicref href=&quot;B.dita#T2&quot;/&gt;                               <br />
-
<pre>
+
&lt;/map&gt;                                                       <br />
-
   | &lt;map&gt;                                                         |
+
<br />
-
</pre>
+
-
<pre>
+
-
   |   &lt;topicref chunk=&quot;to-content select-topic&quot; href=&quot;map.dita&quot;&gt;  |
+
-
</pre>
+
-
<pre>
+
-
   |     &lt;topicref chunk=&quot;select-topic&quot; href=&quot;B.dita#T1&quot;/&gt;         |
+
-
</pre>
+
-
<pre>
+
-
   |   &lt;/topicref&gt;                                                 |
+
-
</pre>
+
-
<pre>
+
-
   |   &lt;topicref href=&quot;A.dita#T2&quot;/&gt;                                |
+
-
</pre>
+
-
<pre>
+
-
   |   &lt;topicref href=&quot;B.dita#T2&quot;/&gt;                                |
+
-
</pre>
+
-
<pre>
+
-
   | &lt;/map&gt;                                                        |
+
-
</pre>
+
-
<pre>
+
-
   |                                                               |
+
-
</pre>
+
-
<pre>
+
-
   +---------------------------------------------------------------+
+
-
</pre>
+
 
<p>
 
<p>
-
   It should produce an index.html, a map.html (containing map.dita, B.dita#T1), a B.html and an A.html. All references to B.dita#T1 in A.dita should be updated to map.html#T1 (or equivalent ID). All references to A.dita in B.dita should remain unchanged.<br />
+
It should produce an index.html, a map.html (containing map.dita, B.dita#T1), a B.html and an A.html. All references to B.dita#T1 in A.dita should be updated to map.html#T1 (or equivalent ID). All references to A.dita in B.dita should remain unchanged.<br />
-
   <br />
+
<br />
-
   Case 1 shows that:<br />
+
Case 1 shows that:<br />
-
     1) Name conflicts may exist during chunk process. (map.dita &lt;--&gt; map.ditamap)<br />
+
1) Name conflicts may exist during chunk process. (map.dita &lt;--&gt; map.ditamap)<br />
-
     2) New contents may be added to an existing file. (map.html actually contains all contents from map.dita and A.dita)<br />
+
2) New contents may be added to an existing file. (map.html actually contains all contents from map.dita and A.dita)<br />
-
     3) References in other topics may need to be updated to correct URL after chunk. (In B.dita, reference to A.dita should be changed to map.html)<br />
+
3) References in other topics may need to be updated to correct URL after chunk. (In B.dita, reference to A.dita should be changed to map.html)<br />
-
   <br />
+
<br />
-
   Case 2 shows that:<br />
+
Case 2 shows that:<br />
-
     1) A dita file may be partially &quot;chunked&quot; (moved into other files). (B.dita#T1 --&gt; map.html)<br />
+
1) A dita file may be partially &quot;chunked&quot; (moved into other files). (B.dita#T1 --&gt; map.html)<br />
-
     2) Concrete topics should be taken into consideration when deciding whether to exclude a file from final results. (Although, B.dita#T1 is processed as a chunk, B.dita still needs to generate output since B.dita#T2 is also referenced in the same file)<br />
+
2) Concrete topics should be taken into consideration when deciding whether to exclude a file from final results. (Although, B.dita#T1 is processed as a chunk, B.dita still needs to generate output since B.dita#T2 is also referenced in the same file)<br />
-
   <br />
+
<br />
-
   Thus, examples above revealed 2 important issues:<br />
+
Thus, examples above revealed 2 important issues:<br />
-
     1) References in topics may be altered after chunk process to point to correct URL.<br />
+
1) References in topics may be altered after chunk process to point to correct URL.<br />
-
     2) Topics in dita files should also be taken into consideration when excluding extra files.<br />
+
2) Topics in dita files should also be taken into consideration when excluding extra files.<br />
-
     3) Files may be partially &quot;chunked&quot; or reserved.<br />
+
3) Files may be partially &quot;chunked&quot; or reserved.<br />
-
 
+
 
</p>
 
</p>
 
<h3>  II. Design on avoiding generating extra files after chunk process.</h3>
 
<h3>  II. Design on avoiding generating extra files after chunk process.</h3>
 
<p>
 
<p>
-
  <br />
+
<br />
-
  Based on these issues, a new list file &quot;canditopics.list&quot; is needed for recording all file URLs that appear as an &quot;href&quot; attribute on any element. A sample fragment of this file may be:<br />
+
Based on these issues, a new list file &quot;canditopics.list&quot; is needed for recording all file URLs that appear as an &quot;href&quot; attribute on any element. A sample fragment of this file may be:<br />
 
</p>
 
</p>
-
<pre>
+
foo.dita#foobar  <br />
-
  +---------------------------------+
+
bar.dita#topicID <br />
-
</pre>
+
foobar.dita      <br />
-
<pre>
+
-
  | foo.dita#foobar                 |
+
-
</pre>
+
-
<pre>
+
-
  | bar.dita#topicID                |
+
-
</pre>
+
-
<pre>
+
-
  | foobar.dita                     |
+
-
</pre>
+
-
<pre>
+
-
  | ...                             |
+
-
</pre>
+
-
<pre>
+
-
  +---------------------------------+
+
-
</pre>
+
 
<p>
 
<p>
-
  <br />
+
<br />
-
  This list is used for denoting all topics that will probably generate output in the final results. Of all these candidate topics, some may be excluded from the final result because they are to be processed in the later chunk step. To strip off these topics, another list &quot;skipchunk.list&quot; is also needed. &quot;skipchunk.list&quot; will denote all &quot;href&quot; attributes that appear together with a &quot;chunk&quot; attribute. Its contents are similar to &quot;canditopics.list&quot;, and a sample fragment would be like:<br />
+
This list is used for denoting all topics that will probably generate output in the final results. Of all these candidate topics, some may be excluded from the final result because they are to be processed in the later chunk step. To strip off these topics, another list &quot;skipchunk.list&quot; is also needed. &quot;skipchunk.list&quot; will denote all &quot;href&quot; attributes that appear together with a &quot;chunk&quot; attribute. Its contents are similar to &quot;canditopics.list&quot;, and a sample fragment would be like:<br />
-
 
+
 
</p>
 
</p>
-
<pre>
+
foo.dita#bar<br />
-
  +---------------------------------+
+
bar.dita#foo<br />
-
</pre>
+
foobar.dita <br />
-
<pre>
+
...         <br />
-
  | foo.dita#bar                    |
+
<br />
-
</pre>
+
-
<pre>
+
-
  | bar.dita#foo                    |
+
-
</pre>
+
-
<pre>
+
-
  | foobar.dita                     |
+
-
</pre>
+
-
<pre>
+
-
  | ...                             |
+
-
</pre>
+
-
<pre>
+
-
  +---------------------------------+
+
-
</pre>
+
 
<p>
 
<p>
-
  <br />
  
-
  When normal chunk process ends, entries from &quot;skipchunk.list&quot; will be stripped out of &quot;canditopics.list&quot;. All the remaining topics should generate output. That is to strip out entries which are not in &quot;canditopics.list&quot; from dita.list and update other lists as well.<br />
  
-
  <br />
  
-
  To achieve that, we need to modify GenListModuleReader to generate &quot;canditopics.list&quot; and &quot;skipchunk.list&quot; accordingly.<br />
  
 
<br />
 
<br />
-
  In GenListModuleReader:<br />
+
When normal chunk process ends, entries from &quot;skipchunk.list&quot; will be stripped out of &quot;canditopics.list&quot;. All the remaining topics should generate output. That is to strip out entries which are not in &quot;canditopics.list&quot; from dita.list and update other lists as well.<br />
-
    1. Add a new HashSet hrefTopicSet for storing all href attributes in any element in a ditamap file.<br />
+
<br />
-
    2. Add a new HashSet chunkTopicSet for storing all href attributes that appear together with chunk attributes in a ditamap file.<br />
+
To achieve that, we need to modify GenListModuleReader to generate &quot;canditopics.list&quot; and &quot;skipchunk.list&quot; accordingly.<br />
-
    3. Add a method for getting hrefTopicSet.<br />
+
<br />
-
    4. Add a method for getting chunkTopicSet.<br />
+
In GenListModuleReader:<br />
-
    5. When the current paring file is a ditamap file, add logics to store the &quot;href&quot; attribute into hrefTopicSet in method &quot;startElement&quot;. If a &quot;href&quot; value is not empty and scope value is not &quot;external&quot; or &quot;peer&quot; and target file is valid, add it to hrefTopicSet.<br />
+
1. Add a new HashSet hrefTopicSet for storing all href attributes in any element in a ditamap file.<br />
-
    6. Add logics for adding elements to chunkTopicSet. If a &quot;href&quot; value appears together with a chunk attribute and scope value is not &quot;external&quot; or &quot;peer&quot; and target file is valid, add it to chunkTopicSet.<br />
+
2. Add a new HashSet chunkTopicSet for storing all href attributes that appear together with chunk attributes in a ditamap file.<br />
-
    <br />
+
3. Add a method for getting hrefTopicSet.<br />
-
  In GenMapAndTopicListModule:<br />
+
4. Add a method for getting chunkTopicSet.<br />
-
    1. Add a new HashSet hrefTopicSet for storing all href attributes in any element in a ditamap file.<br />
+
5. When the current paring file is a ditamap file, add logics to store the &quot;href&quot; attribute into hrefTopicSet in method &quot;startElement&quot;. If a &quot;href&quot; value is not empty and scope value is not &quot;external&quot; or &quot;peer&quot; and target file is valid, add it to hrefTopicSet.<br />
-
    2. Add a new HashSet chunkTopicSet for storing all href attributes that appear together with chunk attributes in a ditamap file.<br />
+
6. Add logics for adding elements to chunkTopicSet. If a &quot;href&quot; value appears together with a chunk attribute and scope value is not &quot;external&quot; or &quot;peer&quot; and target file is valid, add it to chunkTopicSet.<br />
-
    3. Add logics for retrieving &quot;href topics&quot; from parsed ditamap files into hrefTopicSet in method &quot;processParseResult(String)&quot;.<br />
+
<br />
-
    4. Add logics for retrieving &quot;chunk topics&quot; from parsed ditamap files into chunkTopicSet in method &quot;processParseResult(String)&quot;.<br />
+
In GenMapAndTopicListModule:<br />
-
    5. Add logics for preparing hrefTopicSet for writing out in method &quot;outputResult&quot;.<br />
+
1. Add a new HashSet hrefTopicSet for storing all href attributes in any element in a ditamap file.<br />
-
    6. Add logics for preparing chunkTopicSet for writing out in method &quot;outputResult&quot;.<br />
+
2. Add a new HashSet chunkTopicSet for storing all href attributes that appear together with chunk attributes in a ditamap file.<br />
-
    <br />
+
3. Add logics for retrieving &quot;href topics&quot; from parsed ditamap files into hrefTopicSet in method &quot;processParseResult(String)&quot;.<br />
-
  In Constants:<br />
+
4. Add logics for retrieving &quot;chunk topics&quot; from parsed ditamap files into chunkTopicSet in method &quot;processParseResult(String)&quot;.<br />
-
    1. Add a new string &quot;canditopics&quot;.<br />
+
5. Add logics for preparing hrefTopicSet for writing out in method &quot;outputResult&quot;.<br />
-
    2. Add a new string &quot;skipchunk&quot;.<br />
+
6. Add logics for preparing chunkTopicSet for writing out in method &quot;outputResult&quot;.<br />
-
    <br />
+
<br />
-
    Note: Currently we do have logics to generate href targets indeed, but it only generates file names like (foo.xml) with anchor fragment discarded, which is however what we need. Thus we choose to generate a new file in order not to affect other parts of OT.<br />
+
In Constants:<br />
-
    <br />
+
1. Add a new string &quot;canditopics&quot;.<br />
-
  Also we need to modify logics in ChunkModule::updateList() method, right before updating dita.list, we need to refine the original topic list with &quot;canditopics.list&quot; and &quot;skipchunk.list&quot;:<br />
+
2. Add a new string &quot;skipchunk&quot;.<br />
-
    1. Load canditopics.list into hrefTopicSet.<br />
+
<br />
-
    2. Load skipchunk.list into chunkTopicSet.<br />
+
Note: Currently we do have logics to generate href targets indeed, but it only generates file names like (foo.xml) with anchor fragment discarded, which is however what we need. Thus we choose to generate a new file in order not to affect other parts of OT.<br />
-
    3. Remove all entries existing in both sets from hrefTopicSet.<br />
+
<br />
-
    4. For each remaining entry in hrefTopicSet, cut off its anchor fragment. (Now these files are ready to generate output).<br />
+
Also we need to modify logics in ChunkModule::updateList() method, right before updating dita.list, we need to refine the original topic list with &quot;canditopics.list&quot; and &quot;skipchunk.list&quot;:<br />
-
    5. Write hrefTopicSet to dita.list as &quot;FULL_DITA_TOPIC_LIST&quot; and &quot;FULL_DITAMAP_LIST&quot;, replace the old values.<br />
+
1. Load canditopics.list into hrefTopicSet.<br />
-
    6. Update the list with &quot;changeTable&quot; which may contain newly generated chunk file.<br />
+
2. Load skipchunk.list into chunkTopicSet.<br />
-
    7. Remove any unwanted topic file from the temp dir. (Those are not in the updated list).<br />
+
3. Remove all entries existing in both sets from hrefTopicSet.<br />
-
    <br />
+
4. For each remaining entry in hrefTopicSet, cut off its anchor fragment. (Now these files are ready to generate output).<br />
-
    NOTE: &quot;changeTable&quot; will be explained later.<br />
+
5. Write hrefTopicSet to dita.list as &quot;FULL_DITA_TOPIC_LIST&quot; and &quot;FULL_DITAMAP_LIST&quot;, replace the old values.<br />
-
 
+
6. Update the list with &quot;changeTable&quot; which may contain newly generated chunk file.<br />
  +
7. Remove any unwanted topic file from the temp dir. (Those are not in the updated list).<br />
  +
<br />
  +
NOTE: &quot;changeTable&quot; will be explained later.<br />
 
</p>
 
</p>
 
<h3>  III. File name issues in chunk process.</h3>
 
<h3>  III. File name issues in chunk process.</h3>
 
<p>
 
<p>
-
  <br />
+
<br />
-
  According to chunk specification, When creating new documents via chunk processing, the storage object name or identifier (if relevant) is taken from the copyto attribute if set, otherwise the root name is taken from the id attribute if the by-topic policy is in effect and from the name of the referenced document if the by-document policy is in effect. Unfortunately, it does not mention what to do if a name conflict appeared. Therefore we shall follow the naming steps as:<br />
+
According to chunk specification, When creating new documents via chunk processing, the storage object name or identifier (if relevant) is taken from the copyto attribute if set, otherwise the root name is taken from the id attribute if the by-topic policy is in effect and from the name of the referenced document if the by-document policy is in effect. Unfortunately, it does not mention what to do if a name conflict appeared. Therefore we shall follow the naming steps as:<br />
-
    1. Try to use a specified name, (such as copyto value), if any.<br />
+
1. Try to use a specified name, (such as copyto value), if any.<br />
-
    2. If chunk=&quot;to-content&quot; is specified in map element, then chunk result should first try to use the ditamap's file name.<br />
+
2. If chunk=&quot;to-content&quot; is specified in map element, then chunk result should first try to use the ditamap's file name.<br />
-
    3. If &quot;by-topic&quot; is specified in chunk attribute, then chunk result should first try to use topic id as file name.<br />
+
3. If &quot;by-topic&quot; is specified in chunk attribute, then chunk result should first try to use topic id as file name.<br />
-
    4. Otherwise use name of the referenced file which is to be chunked.<br />
+
4. Otherwise use name of the referenced file which is to be chunked.<br />
-
    5. If there were name conflicts:<br />
+
5. If there were name conflicts:<br />
-
         a) Mark this conflict.<br />
+
a) Mark this conflict.<br />
-
         b) Generate a random file name and associate it with the orignal file name.<br />
+
b) Generate a random file name and associate it with the orignal file name.<br />
-
         c) Use the random name to complete chunk process.<br />
+
c) Use the random name to complete chunk process.<br />
-
         d) When we have removed unwanted extra files, for each name conflict, if its original file name still exists, it means this conflict cannot be resolved so the random name is used as new file name. Otherwise, rename the randomly named file to its original name.<br />
+
d) When we have removed unwanted extra files, for each name conflict, if its original file name still exists, it means this conflict cannot be resolved so the random name is used as new file name. Otherwise, rename the randomly named file to its original name.<br />
-
    <br />
+
<br />
-
    In order to track name conflicts, the original &quot;changeTable&quot; structure needs to be modified to meet our needs.<br />
+
In order to track name conflicts, the original &quot;changeTable&quot; structure needs to be modified to meet our needs.<br />
-
    
+
 
</p>
 
</p>
-
<pre>
+
    +--------------<br />
-
    +---------------+
+
    | Change Table<br />
-
</pre>
+
    +--------------<br />
-
<pre>
+
            |<br />
-
    | Change Table  |
+
            |       +------------------------------------+      +-------------------------+<br />
-
</pre>
+
            +----+ Original File URL (String) +----&gt;| Target File URL (String)<br />
-
<pre>
+
                    +------------------------------------+      +-------------------------+<br />
-
    +---------------+
+
                                                                            | Random File URL (String)|<br />
-
</pre>
+
                                                                           +-------------------------+<br />
-
<pre>
+
<br />
-
            |
+
                                       Key                                     Value<br />
-
</pre>
+
-
<pre>
+
-
            |       +----------------------------+     +-------------------------+
+
-
</pre>
+
-
<pre>
+
-
            +-------+ Original File URL (String) +----&gt;| Target File URL (String)|
+
-
</pre>
+
-
<pre>
+
-
                    +----------------------------+     +-------------------------+
+
-
</pre>
+
-
<pre>
+
-
                                                       | Random File URL (String)|
+
-
</pre>
+
-
<pre>
+
-
                                                       +-------------------------+
+
-
</pre>
+
-
<pre>
+
-
                                                 
+
-
</pre>
+
-
<pre>
+
-
                                  Key                              Value
+
-
</pre>
+
 
<p>
 
<p>
-
    <br />
+
<br />
-
    A name conflict can be denoted as an association bewteen original URL and target URL. Original URL denotes the original file URL as a reference URL in &quot;href&quot; attribute. If a topic is processed into a separate chunk file, all references to the topic should be updated to point to the new target URL in the chunk file. If unfortunately the target file is already existing, a random file name will be used and associated with the original one because we don't want to lose that infomation which is needed for resolving conflict later.<br />
+
A name conflict can be denoted as an association bewteen original URL and target URL. Original URL denotes the original file URL as a reference URL in &quot;href&quot; attribute. If a topic is processed into a separate chunk file, all references to the topic should be updated to point to the new target URL in the chunk file. If unfortunately the target file is already existing, a random file name will be used and associated with the original one because we don't want to lose that infomation which is needed for resolving conflict later.<br />
-
    <br />
+
<br />
-
    In addition, we need to update all references in all topics according to &quot;changeTable&quot; because topics may be moved between files and references should follow the chunk results.<br />
+
In addition, we need to update all references in all topics according to &quot;changeTable&quot; because topics may be moved between files and references should follow the chunk results.<br />
-
    <br />
+
<br />
-
    To achieve that we need to:<br />
+
To achieve that we need to:<br />
-
      1. Modify logics in method &quot;read&quot; of ChunkMapReader:
+
1. Modify logics in method &quot;read&quot; of ChunkMapReader:
 
</p>
 
</p>
-
<pre>
+
<pre class="low">
 
         if &quot;to-content&quot; is specified in map chunk
 
         if &quot;to-content&quot; is specified in map chunk
 
</pre>
 
</pre>
-
<pre>
+
<pre class="low">
 
           Use map name as new chunk file name
 
           Use map name as new chunk file name
 
</pre>
 
</pre>
-
<pre>
+
<pre class="low">
 
           if an identical file name already exists
 
           if an identical file name already exists
 
</pre>
 
</pre>
-
<pre>
+
<pre class="low">
 
             Use a randomly generated name as chunk file name
 
             Use a randomly generated name as chunk file name
 
</pre>
 
</pre>
-
<pre>
+
<pre class="low">
 
             Associate target name and random with original name
 
             Associate target name and random with original name
 
</pre>
 
</pre>
-
<pre>
+
<pre class="low">
 
           end if
 
           end if
 
</pre>
 
</pre>
-
<pre>
+
<pre class="low">
 
         end if
 
         end if
 
</pre>
 
</pre>
-
<pre>
+
<pre class="low">
 
         Do normal chunk process
 
         Do normal chunk process
-
</pre>
  
-
<pre>
  
-
         ...
  
 
</pre>
 
</pre>
 
<p>
 
<p>
-
      2. Modify logics in method &quot;startElement&quot; of ChunkTopicParser:
+
if &quot;to-content&quot; is specified in map chunk<br />
  +
           Use map name as new chunk file name<br />
  +
           if an identical file name already exists<br />
  +
             Use a randomly generated name as chunk file name<br />
  +
             Associate target name and random with original name<br />
  +
           end if<br />
  +
         end if<br />
  +
         Do normal chunk process<br />
  +
         ...
 
</p>
 
</p>
-
<pre>
  
-
         if &quot;by-topic&quot; is specified in chunk attribute
  
-
</pre>
  
-
<pre>
  
-
           Use current topic id as new chunk file name
  
-
</pre>
  
-
<pre>
  
-
           if an identical file name already exists
  
-
</pre>
  
-
<pre>
  
-
             Use a randomly generated name as chunk file name
  
-
</pre>
  
-
<pre>
  
-
             Associate target name and random with original name
  
-
</pre>
  
-
<pre>
  
-
           end if
  
-
</pre>
  
-
<pre>
  
-
         end if
  
-
</pre>
  
-
<pre>
  
-
         Do normal chunk process
  
-
</pre>
  
-
<pre>
  
-
         ...
  
-
</pre>
  
 
<p>
 
<p>
-
      3. Add a method &quot;resolveNamingConflicts&quot; in ChunkModule, which will be called right after &quot;updateList&quot; to resolve name conflicts. The method completes the following tasks:
+
2. Modify logics in method &quot;startElement&quot; of ChunkTopicParser:
 
</p>
 
</p>
-
<pre>
  
-
           for each entry in changeTable
  
-
</pre>
  
-
<pre>
  
-
           do
  
-
</pre>
  
-
<pre>
  
-
             if randomURL != null &amp;&amp; originalURL.exists()
  
-
</pre>
  
-
<pre>
  
-
               targetURL = randomURL
  
-
</pre>
  
-
<pre>
  
-
             else if !originalURL.exists()
  
-
</pre>
  
-
<pre>
  
-
               remove originalURL
  
-
</pre>
  
-
<pre>
  
-
               rename originalURL to targetURL
  
-
</pre>
  
-
<pre>
  
-
             end if
  
-
</pre>
  
-
<pre>
  
-
             targetURL = null
  
-
</pre>
  
-
<pre>
  
-
           done
  
-
</pre>
  
 
<p>
 
<p>
-
       4. Add a method &quot;updateTopicRefs&quot; in ChunkModule, which will be called right after &quot;resoveNamingConflicts&quot; to update topic references. It completes the fllowing tasks:
+
if &quot;by-topic&quot; is specified in chunk attribute<br />
  +
           Use current topic id as new chunk file name<br />
  +
           if an identical file name already exists<br />
  +
             Use a randomly generated name as chunk file name<br />
  +
             Associate target name and random with original name<br />
  +
           end if<br />
  +
         end if<br />
  +
         Do normal chunk process<br />
  +
         ...
 
</p>
 
</p>
-
<pre>
+
<p>
-
            for each topic file in the updated list of topics
+
3. Add a method &quot;resolveNamingConflicts&quot; in ChunkModule, which will be called right after &quot;updateList&quot; to resolve name conflicts. The method completes the following tasks:
-
</pre>
+
</p>
-
<pre>
+
<p>
-
            do
+
for each entry in changeTable<br />
-
</pre>
+
           do<br />
-
<pre>
+
             if randomURL != null &amp;&amp; originalURL.exists()<br />
-
              parse the topic
+
               targetURL = randomURL<br />
-
</pre>
+
             else if !originalURL.exists()<br />
-
<pre>
+
               remove originalURL<br />
-
              if current element contains &quot;href&quot; attribute
+
               rename originalURL to targetURL<br />
-
</pre>
+
             end if<br />
-
<pre>
+
             targetURL = null<br />
-
                if &quot;href&quot; value is a key in &quot;changeTable&quot;
+
           done
-
</pre>
+
</p>
-
<pre>
+
<p>
-
                  update &quot;href&quot; value with changeTable.getTargetValue(&quot;href&quot; value)
+
4. Add a method &quot;updateTopicRefs&quot; in ChunkModule, which will be called right after &quot;resoveNamingConflicts&quot; to update topic references. It completes the fllowing tasks:
-
</pre>
+
</p>
-
<pre>
+
for each topic file in the updated list of topics<br />
-
                end if
+
            do<br />
-
</pre>
+
              parse the topic<br />
-
<pre>
+
              if current element contains &quot;href&quot; attribute<br />
-
              endif
+
                if &quot;href&quot; value is a key in &quot;changeTable&quot;<br />
-
</pre>
+
                  update &quot;href&quot; value with changeTable.getTargetValue(&quot;href&quot; value)<br />
-
<pre>
+
                end if<br />
  +
              endif<br />
 
            done
 
            done
-
</pre>
  
 
<p>
 
<p>
-
            <br />
  
-
        NOTE: Parsing all topic files again looks very bad but I think it is inevitable. Chunk process needs all files present during the whole process. Later, some of them may disappear or be renamed after chunk process finishes. However, it can not be determined if a file will or will not remain unchanged until all chunk process finishes. Therefore it would be a must to update all references in another freshly new parse.<br />
  
 
<br />
 
<br />
-
  VI. Summarization.<br />
+
NOTE: Parsing all topic files again looks very bad but I think it is inevitable. Chunk process needs all files present during the whole process. Later, some of them may disappear or be renamed after chunk process finishes. However, it can not be determined if a file will or will not remain unchanged until all chunk process finishes. Therefore it would be a must to update all references in another freshly new parse.<br />
-
  <br />
+
</p>
-
  To summarize all, measures to solve chunk problems would be suggested as:<br />
+
<h3>
-
    1. Generate &quot;canditopics.list&quot; and &quot;skipchunk.list&quot; in GenMapAndTopicListModuel.<br />
+
VI. Summarization.</h3>
-
    2. Mark name conflicts in &quot;changeTable&quot; during chunk process.<br />
+
<p>
-
    3. Update dita.list with &quot;canditopics&quot; and &quot;skipchunk&quot; together with &quot;changeTable&quot;.<br />
+
<br />
-
    4. Update all references in topics according to &quot;changeTable&quot; in the end of chunk process.
+
To summarize all, measures to solve chunk problems would be suggested as:<br />
  +
1. Generate &quot;canditopics.list&quot; and &quot;skipchunk.list&quot; in GenMapAndTopicListModuel.<br />
  +
2. Mark name conflicts in &quot;changeTable&quot; during chunk process.<br />
  +
3. Update dita.list with &quot;canditopics&quot; and &quot;skipchunk&quot; together with &quot;changeTable&quot;.<br />
  +
4. Update all references in topics according to &quot;changeTable&quot; in the end of chunk process.
 
</p>
 
</p>
 
 
Revision of Mon, 2009-01-12 18:51:

Design for chunking enhancements

Design for chunking enhancements

 

I. Prologue.


Before we start to discuss the detailed implementation design, let's first have a look at two examples. Here we have: map.ditamap, map.dita, A.dita, B.dita. File B.dita has cross references to map.dita and A.dita. File A.dita has a reference to map.dita and a reference to B.dita. Both A and B have multiple topics, in other words, they are both ditabase files.

*) Case 1: map.ditamap

<map chunk="to-content">                             
<topicref chunk="select-topic" href="map.dita"/>   
<topicref chunk="select-topic" href="A.dita">      
<topicref chunk="select-topic" href="map.dita"/>
</topicref>                                        
</map>                                               


It should produce an index.html, a map.html containing all contents from both map.dita and A.dita, and a B.html. References to map.dita in A.dita and B.dita should be updated to referencing to map.html, with proper topic ID appended. References to A.dita in B.dita should be updated to referencing to map.html, with proper ID.

*) Case 2: map.ditamap:

<map>                                                        
  <topicref chunk="to-content select-topic" href="map.dita">
    <topicref chunk="select-topic" href="B.dita#T1"/>        
  </topicref>                                                
  <topicref href="A.dita#T2"/>                               
  <topicref href="B.dita#T2"/>                               
</map>                                                       

It should produce an index.html, a map.html (containing map.dita, B.dita#T1), a B.html and an A.html. All references to B.dita#T1 in A.dita should be updated to map.html#T1 (or equivalent ID). All references to A.dita in B.dita should remain unchanged.

Case 1 shows that:
1) Name conflicts may exist during chunk process. (map.dita <--> map.ditamap)
2) New contents may be added to an existing file. (map.html actually contains all contents from map.dita and A.dita)
3) References in other topics may need to be updated to correct URL after chunk. (In B.dita, reference to A.dita should be changed to map.html)

Case 2 shows that:
1) A dita file may be partially "chunked" (moved into other files). (B.dita#T1 --> map.html)
2) Concrete topics should be taken into consideration when deciding whether to exclude a file from final results. (Although, B.dita#T1 is processed as a chunk, B.dita still needs to generate output since B.dita#T2 is also referenced in the same file)

Thus, examples above revealed 2 important issues:
1) References in topics may be altered after chunk process to point to correct URL.
2) Topics in dita files should also be taken into consideration when excluding extra files.
3) Files may be partially "chunked" or reserved.

  II. Design on avoiding generating extra files after chunk process.


Based on these issues, a new list file "canditopics.list" is needed for recording all file URLs that appear as an "href" attribute on any element. A sample fragment of this file may be:

foo.dita#foobar  
bar.dita#topicID
foobar.dita     


This list is used for denoting all topics that will probably generate output in the final results. Of all these candidate topics, some may be excluded from the final result because they are to be processed in the later chunk step. To strip off these topics, another list "skipchunk.list" is also needed. "skipchunk.list" will denote all "href" attributes that appear together with a "chunk" attribute. Its contents are similar to "canditopics.list", and a sample fragment would be like:

foo.dita#bar
bar.dita#foo
foobar.dita
...         


When normal chunk process ends, entries from "skipchunk.list" will be stripped out of "canditopics.list". All the remaining topics should generate output. That is to strip out entries which are not in "canditopics.list" from dita.list and update other lists as well.

To achieve that, we need to modify GenListModuleReader to generate "canditopics.list" and "skipchunk.list" accordingly.

In GenListModuleReader:
1. Add a new HashSet hrefTopicSet for storing all href attributes in any element in a ditamap file.
2. Add a new HashSet chunkTopicSet for storing all href attributes that appear together with chunk attributes in a ditamap file.
3. Add a method for getting hrefTopicSet.
4. Add a method for getting chunkTopicSet.
5. When the current paring file is a ditamap file, add logics to store the "href" attribute into hrefTopicSet in method "startElement". If a "href" value is not empty and scope value is not "external" or "peer" and target file is valid, add it to hrefTopicSet.
6. Add logics for adding elements to chunkTopicSet. If a "href" value appears together with a chunk attribute and scope value is not "external" or "peer" and target file is valid, add it to chunkTopicSet.

In GenMapAndTopicListModule:
1. Add a new HashSet hrefTopicSet for storing all href attributes in any element in a ditamap file.
2. Add a new HashSet chunkTopicSet for storing all href attributes that appear together with chunk attributes in a ditamap file.
3. Add logics for retrieving "href topics" from parsed ditamap files into hrefTopicSet in method "processParseResult(String)".
4. Add logics for retrieving "chunk topics" from parsed ditamap files into chunkTopicSet in method "processParseResult(String)".
5. Add logics for preparing hrefTopicSet for writing out in method "outputResult".
6. Add logics for preparing chunkTopicSet for writing out in method "outputResult".

In Constants:
1. Add a new string "canditopics".
2. Add a new string "skipchunk".

Note: Currently we do have logics to generate href targets indeed, but it only generates file names like (foo.xml) with anchor fragment discarded, which is however what we need. Thus we choose to generate a new file in order not to affect other parts of OT.

Also we need to modify logics in ChunkModule::updateList() method, right before updating dita.list, we need to refine the original topic list with "canditopics.list" and "skipchunk.list":
1. Load canditopics.list into hrefTopicSet.
2. Load skipchunk.list into chunkTopicSet.
3. Remove all entries existing in both sets from hrefTopicSet.
4. For each remaining entry in hrefTopicSet, cut off its anchor fragment. (Now these files are ready to generate output).
5. Write hrefTopicSet to dita.list as "FULL_DITA_TOPIC_LIST" and "FULL_DITAMAP_LIST", replace the old values.
6. Update the list with "changeTable" which may contain newly generated chunk file.
7. Remove any unwanted topic file from the temp dir. (Those are not in the updated list).

NOTE: "changeTable" will be explained later.

  III. File name issues in chunk process.


According to chunk specification, When creating new documents via chunk processing, the storage object name or identifier (if relevant) is taken from the copyto attribute if set, otherwise the root name is taken from the id attribute if the by-topic policy is in effect and from the name of the referenced document if the by-document policy is in effect. Unfortunately, it does not mention what to do if a name conflict appeared. Therefore we shall follow the naming steps as:
1. Try to use a specified name, (such as copyto value), if any.
2. If chunk="to-content" is specified in map element, then chunk result should first try to use the ditamap's file name.
3. If "by-topic" is specified in chunk attribute, then chunk result should first try to use topic id as file name.
4. Otherwise use name of the referenced file which is to be chunked.
5. If there were name conflicts:
a) Mark this conflict.
b) Generate a random file name and associate it with the orignal file name.
c) Use the random name to complete chunk process.
d) When we have removed unwanted extra files, for each name conflict, if its original file name still exists, it means this conflict cannot be resolved so the random name is used as new file name. Otherwise, rename the randomly named file to its original name.

In order to track name conflicts, the original "changeTable" structure needs to be modified to meet our needs.

    +--------------
    | Change Table
    +--------------
            |
            |       +------------------------------------+      +-------------------------+
            +----+ Original File URL (String) +---->| Target File URL (String)
                    +------------------------------------+      +-------------------------+
                                                                            | Random File URL (String)|
                                                                           +-------------------------+

                                       Key                                     Value


A name conflict can be denoted as an association bewteen original URL and target URL. Original URL denotes the original file URL as a reference URL in "href" attribute. If a topic is processed into a separate chunk file, all references to the topic should be updated to point to the new target URL in the chunk file. If unfortunately the target file is already existing, a random file name will be used and associated with the original one because we don't want to lose that infomation which is needed for resolving conflict later.

In addition, we need to update all references in all topics according to "changeTable" because topics may be moved between files and references should follow the chunk results.

To achieve that we need to:
1. Modify logics in method "read" of ChunkMapReader:

         if "to-content" is specified in map chunk            Use map name as new chunk file name            if an identical file name already exists              Use a randomly generated name as chunk file name              Associate target name and random with original name            end if          end if          Do normal chunk process

if "to-content" is specified in map chunk
           Use map name as new chunk file name
           if an identical file name already exists
             Use a randomly generated name as chunk file name
             Associate target name and random with original name
           end if
         end if
         Do normal chunk process
         ...

2. Modify logics in method "startElement" of ChunkTopicParser:

if "by-topic" is specified in chunk attribute
           Use current topic id as new chunk file name
           if an identical file name already exists
             Use a randomly generated name as chunk file name
             Associate target name and random with original name
           end if
         end if
         Do normal chunk process
         ...

3. Add a method "resolveNamingConflicts" in ChunkModule, which will be called right after "updateList" to resolve name conflicts. The method completes the following tasks:

for each entry in changeTable
           do
             if randomURL != null && originalURL.exists()
               targetURL = randomURL
             else if !originalURL.exists()
               remove originalURL
               rename originalURL to targetURL
             end if
             targetURL = null
           done

4. Add a method "updateTopicRefs" in ChunkModule, which will be called right after "resoveNamingConflicts" to update topic references. It completes the fllowing tasks:

for each topic file in the updated list of topics
            do
              parse the topic
              if current element contains "href" attribute
                if "href" value is a key in "changeTable"
                  update "href" value with changeTable.getTargetValue("href" value)
                end if
              endif
            done


NOTE: Parsing all topic files again looks very bad but I think it is inevitable. Chunk process needs all files present during the whole process. Later, some of them may disappear or be renamed after chunk process finishes. However, it can not be determined if a file will or will not remain unchanged until all chunk process finishes. Therefore it would be a must to update all references in another freshly new parse.

VI. Summarization.


To summarize all, measures to solve chunk problems would be suggested as:
1. Generate "canditopics.list" and "skipchunk.list" in GenMapAndTopicListModuel.
2. Mark name conflicts in "changeTable" during chunk process.
3. Update dita.list with "canditopics" and "skipchunk" together with "changeTable".
4. Update all references in topics according to "changeTable" in the end of chunk process.

XML.org Focus Areas: BPEL | DITA | ebXML | IDtrust | OpenDocument | SAML | UBL | UDDI
OASIS sites: OASIS | Cover Pages | XML.org | AMQP | CGM Open | eGov | Emergency | IDtrust | LegalXML | Open CSA | OSLC | WS-I