This takes up something that Roger (I think) introduced some time ago. chris These ideas and questions have their origins in a discussion (non-net) I had with Roger and Joachim Schrod back in March. They explained to me that xindy needs to deal with mark-up within index entries that have the following properties: -- it affects the order in which entries should appear: eg in indexing a book about a programming language, it is likely that the same word appears both as reserved-word and in its normal-language use; thus it is necessary for a program such as xindy to be able to correctly order such entries; -- arbitrarily complex mark-up can appear (eg several levels and many types at each level); Now this is an area where there appear to be no standards and where little research has been done either into how current indexes and indexers (here I mean people not software:-) handle such cases or into the theoretical possibilities and good abstract models. So here are some ideas and suggestions to start this research. Since these are somewhat theoretical (note the total lack of examples), it would be particularly useful to hear from anyone who has produced indexes which needed to distinguish different types of entry or mark-up in this way. I am thinking of "mark-up" as being "logical mark-up", whether or not it is linked in the formatting of the document to distinct typographical treatment. I have chosen to analyse this problem as follows: 1. Reducing multi-level mark-up to single-level (flat) mark-up (but not necessarily to just one type of mark-up for the whole entry). I think that it is essential that "no-mark-up" is itself a type of mark-up. However, this leads to some problems in interpreting multi-level mark-up: eg, if there is no explicit mark-up of some part of the text then the level of the mark-up present can be given the wrong value (I can give examples of this if people would like to see some). I see two ways around this: -- have a "no-mark-up" tag that must be used in cases where the built-in assumptions would get the level wrong; -- each type of mark-up must have a unique level --- then the lack of mark-up at other levels can be deduced. My some such method we can assign to each character a well-defined tower of mark-up (ie some mark-up at each possible level). Any reasonable ordering must be based on a total ordering of the set of all possible mark-up towers. This is what I mean by "reducing to one-level of mark-up". 2. How much flexibility do we need to give in assigning an order to this set of towers? To take, as an example, the simple case of just one-level of mark-up: clearly then the user will have to define a total order on the types of mark-up that can occur. In practice (probably), it may only be necessary to point out to users that the more complex multi-level mark-ups can be flattened into a (possibly large) set of possible one-label mark-ups. If this set does not get too large then it may be better to simply ask the user to specify and to order this flattened set. If this is not practical then it will be necessary to build an interface for specifying orders on the mark-up at each level and how to combine these to produce an oder on the set of towers. There are two obvious possibilities for this combining: inner- or outer-level most significant? My intuition is that outer-level should be (since, in general, inner-levels are likely to be no-mark-up) 3. All that only gives an ordering on the mark-up, not on the actual index entries. It has been suggested that it may be necessary to allow the ordering to depend, for example, on the type of the longest "uniformly marked-up consecutive substring" I think that we should (at least for testing to see if it is sufficient) only allow to orderings based on forward or reverse character-comparisons, ie the same possibilities as for the lexicographic properties. 4. There is also the question of how ordering by mark-up relates to ordering by lexicographic properties of the characters. We have been, I think, assuming that mark-up becomes significant for ordering only when two entries are otherwise identical but it is not so simple. An important special case is when one is, in effect, producing multiple indexes: one could model this as a case when a certain type of mark-up is the highest level of ordering, before anything based on the alphabet. Of course it can also be modeled by simply completely separating the entries for the different indexes before applying xindy separately for each index. But suppose (not too unlikely, I think) that the specification asks for a separation of such types of index entry only "within first-letter" (eg all the identifiers beginning with A or a come first, then all other entries beginning with A or a; similarly for B or b, etc). It is possible that some of these ideas may also be useful for multi-lingual indexes, another area where there has been little research, either empirical or more theoretical. This takes up something that Roger (I think) introduced some time ago. chris ====================================================================== These ideas and questions have their origins in a discussion (non-net) I had with Roger and Joachim Schrod back in March. They explained to me that xindy needs to deal with mark-up within index entries that have the following properties: -- it affects the order in which entries should appear: eg in indexing a book about a programming language, it is likely that the same word appears both as reserved-word and in its normal-language use; thus it is necessary for a program such as xindy to be able to correctly order such entries; -- arbitrarily complex mark-up can appear (eg several levels and many types at each level); Now this is an area where there appear to be no standards and where little research has been done either into how current indexes and indexers (here I mean people not software:-) handle such cases or into the theoretical possibilities and good abstract models. So here are some ideas and suggestions to start this research. Since these are somewhat theoretical (note the total lack of examples), it would be particularly useful to hear from anyone who has produced indexes which needed to distinguish different types of entry or mark-up in this way. I am thinking of "mark-up" as being "logical mark-up", whether or not it is linked in the formatting of the document to distinct typographical treatment. I have chosen to analyse this problem as follows: 1. Reducing multi-level mark-up to single-level (flat) mark-up (but not necessarily to just one type of mark-up for the whole entry). I think that it is essential that "no-mark-up" is itself a type of mark-up. However, this leads to some problems in interpreting multi-level mark-up: eg, if there is no explicit mark-up of some part of the text then the level of the mark-up present can be given the wrong value (I can give examples of this if people would like to see some). I see two ways around this: -- have a "no-mark-up" tag that must be used in cases where the built-in assumptions would get the level wrong; -- each type of mark-up must have a unique level --- then the lack of mark-up at other levels can be deduced. My some such method we can assign to each character a well-defined tower of mark-up (ie some mark-up at each possible level). Any reasonable ordering must be based on a total ordering of the set of all possible mark-up towers. This is what I mean by "reducing to one-level of mark-up". 2. How much flexibility do we need to give in assigning an order to this set of towers? To take, as an example, the simple case of just one-level of mark-up: clearly then the user will have to define a total order on the types of mark-up that can occur. In practice (probably), it may only be necessary to point out to users that the more complex multi-level mark-ups can be flattened into a (possibly large) set of possible one-label mark-ups. If this set does not get too large then it may be better to simply ask the user to specify and to order this flattened set. If this is not practical then it will be necessary to build an interface for specifying orders on the mark-up at each level and how to combine these to produce an oder on the set of towers. There are two obvious possibilities for this combining: inner- or outer-level most significant? My intuition is that outer-level should be (since, in general, inner-levels are likely to be no-mark-up) 3. All that only gives an ordering on the mark-up, not on the actual index entries. It has been suggested that it may be necessary to allow the ordering to depend, for example, on the type of the longest "uniformly marked-up consecutive substring". I think that we should (at least for testing to see if it is sufficient) only allow orderings based on forward or reverse character-comparisons, ie the same possibilities as for the orderings by lexicographic properties. 4. There is also the question of how ordering by mark-up relates to ordering by lexicographic properties of the characters. We have been, I think, assuming that mark-up becomes significant for ordering only when two entries are otherwise identical---but is it so simple? An important special case is when one is, in effect, producing multiple indexes, eg completely separating use as identifiers from normal language use of words. One could model this as the case when a certain type of mark-up is the first level of ordering, before anything based on the alphabet. Of course it can also be modeled by simply completely separating the entries for the different indexes before applying xindy separately for each index. But suppose (not too unlikely, I think) that the specification asks for a separation of such types of index entry only "within first-letter" (eg all the identifiers beginning with A or a come first, then all other entries beginning with A or a; similarly for B or b, etc). 5. It is possible that some of these ideas may also be useful for multi-lingual indexes, another area where there has been little research, either empirical or more theoretical.