Hi Chris and all others, It took me quite a while to think about all what Chris wrote in his mail. I was still unsatisfied by discussing user interface issues before I had not understand the real sorting problem. For this reason I started *not* to think about the user interface at the first place and think about the theoretical foundations of our problem instead. The first questions I came across were "What does mark-up actually mean in our problem domain?", "What are possible mark-ups?" and "How do they influence the sorting process?". I'll give a rather complex, though real-live example of what I'm thinking of. We have to process an author index containing names of authors from all over the world. There may appear Chinese authors written in their native language as well as in the Latin-like pronounciation speech (don't know how to say in English). Imagine the Latin-like written names appear between the other authors and the Chinese written symbols appear at the end of the index in a separate list of Chinese authors (which may only of interest for all people capable of reading Chinese). The index processor must understand that an authors name is Chinese, a natural application of a mark-up named (lang=chinese). Chinese names are sometimes sorted based on the number of strokes that appear in the corresponding Chinese symbol. The number of strokes is then an additional structural mark-up. A Chinese name in our example index may then be represented as (lang=chinese (strokes=15 "Liang")) This notation I use here represents mark-up as a named node in a tree with arbitrary many children. I'll come back to this issue later. Another problem that arises in this context are for example, Spanish authors. As we know from the discussion papers, the collating rules for Spanish are different than for other languages (ie. "Ch", "Ll"). Hence, structural mark-up may also influence the collation rules of a keyword. This is a completely new aspect in our discussion since we have not yet considered mark-up influencing collation rules. In fact this is a problem that has to be solved in the parsing phase of the keyword in my opinion. I'd say that a keyword is a tree-based recursive data structure defined as follows: A keyword is (a) a mark-up entity consisting of an identifier containing a sequence of keywords as its childs, or (b) a word, i.e. a sequence of letters. A letter is something that has been obtained by collation rules mapping a sequence of characters into a letter. A possible keyword is (denoted in a list-like style): (no-mark-up "This" (emph "is an") "example" (bold "keyword") ) Here we could declare that "example" and (no-mark-up "example") are essentially the same. Coming back to out initial example we must sort the Chinese authors to the very end of the index. Thus, the first sort phase (I'll stay with this concept since it is intuitively clear from the users' perspective) separates Chinese from non-Chinese authors. Here we have an example for which there is a need to start sorting based on mark-up and not on letters. In a further sorting phase the Chinese authors must be sorted according to the number of strokes of their symbols. In fact this sorting is meaningless for non-Chinese keywords. Thus, the set of meaningful sorting rules may depend on the mark-up as well, introducing another new aspect. I think, that these introductory examples give some idea about what structural mark-up can be used for. It is essentially some kind of meta-information that can be used to sort keywords. If we agree on the view that a keyword can be represented as a tree with letters at its leaves we see several similarities with other application domains. Imagine the possible mark-up schemes are not arbitrarily (lang=chinese may also appear within a strokes=15 element), we are directly lead to SGML and its DTDs and instances. My question is, if there is a grammar on how the mark-up must be used in a keyword, in what sense does a keyword differ from an instance of a SGLM DTD? In other terms, what is the difference between a keyword and a structured document? Currently, I don't see any difference. The question of sorting a keyword is essentially the same question as asking which one of two SGML instances is sorted before the other. I do not address the problem of character encodings or other SGML peculiarities, my intent is only to ask the question "How are structured documents sorted?". And this question seems to be the essence of all our problems. The problem has turned now into the question, how do we sort trees, according to the above outlined notion of tree. From the intuitive point of view, sorting is a process done in several phases. We compare according to some criteria and then obtain some elements that are equal in this sense. We continue to sort according to a new criteria until we finally obtain a total order on all elements. If we want to sort lexicographically according to only *one* sequence of something (i.e. numbers) we need to map a tree (in our sense) into a sequence. How can this be accomplished? A first observation shows that the sorting criteria for each sorting phase is something that can be obtained by extracting and filtering certain information from the tree and serializing this information. For example we can introduce some kind of tree manipulation operations that remove nodes from the tree and exchange subtrees. Exchanging subtrees could then be used to reverse sequences (needed for French sorting rules) and so on. Thus a tree-rewriting system can be used to transform a tree into a sequence (i.e a root-node with a sequence of atomic values). My notion of atomic value is not restricted to letters only--a node containing the name of a mark-up node suffices as well (I start to write more and more informal, since this is something I'm not yet sure about it). The idea is to map a tree to a sequence. And to map several successive sorting phases to several sequences that when concatenated (with some delimiting element) would obtain a total order on the trees. The current approach implemented in xindy is essentially that of a string-rewriting system, that to some extend can be used to rewrite a string, but which does not deal with structural information in a broader sense. Additionally it does not deal with an object-oriented view on letters as attributed objects (cf. define-letter from previous mails). What we need is a generalization of this principle in some form, I currently have no idea of. In fact, the attributes invented in the define-letter declaration can all be represented as mark-up nodes in the tree as well. There is no difference between these two views in my opinion. In fact, to me it seems to be hard to find semi-solutions to this problem that address only some of the problems but not all of them. Another aspect is that the sorting process may be structured differently. It can be described in terms of an acyclic graph having as is edges the specfication of the sorting process that has to be applied. lang=chinese o-------------->o-----+----->o strokes=1 ! +----->o strokes=2 ! lang=others +... +-------------->o .... This is an idea that came into my mind just when I was typing. I'm not yet sure about this aspect. But it may be a natural way of defining sorting processes and reusing paths in the graph, which seems to be quite useful in practice. Here another problem occurs. Are categories or enumerations of attributes still useful as it was introduced in the define-letter declaration? I have not directly answered Chris' mail but I think that this mail addresses almost all of the problems described yet. My intent is to first open discussion for this view on sorting processes before we continue to think about possible user interfaces. From my personal point of view I have now better understood what the real problem and its complexity is (I hope so:). We find ourself now quite far away from where we initially started. Thanks for your patience reading this :) Bye Roger P.S.: I'd like to thank Gabor, whom I tried to convince that keywords are structured documents and who did helpful comments on my ideas. -- ====================================================================== Roger Kehr kehr@iti.informatik.th-darmstadt.de Computer Science Department Technical University of Darmstadt