One of the most interesting new features of
While playing around with the sort-rule
mechanisms of xindy
I was confronted with the problem that there often exist many ways to
accomplish as particular sorting order but I was not aware of the
underlying rules how to write a correct sorting specification.
Therefore, I wrote down a simple guideline which is presented.
Note: This guideline was a first attempt to understande the meta-rules of writing sort rules and I'll give it here to make you able to get into the discussion!
How to write correct sort and merge rules?
One of the biggest hurdle in mastering xindy is the correct declaration of sort and merge rules. To bring a little bit of structure into this problem area we have prepared a set of rules-of-thumb that may help you to correclt specify rules.
First of all, if you have ever written a set of rules and the result is what you expect, then your rules are correct, since correctness is only a question of the result and not the way you did it.
But you may run into difficulties later if the rule set is not correct meaning that your rules may fail under certain circumstances.
What you always must keep in mind is that the sort rules are applied to the result of the merge mapping. Thus, if you have already made a mistake in the merge mapping the sort rules often do not work as expected but actually may be correct!
The most common problem is to define an alphabet that is based on the
latin alphabet but is extended by some letters. Inserting a letter
often actually means to sort it before and after a letter, thus it is
relative to an existing letter. For example the letter á must
in some languages be sorted before a. In this case we should specify
the following sort rule using the special letter
~b
having a
sorting order which is lower than all other characters. Actually it is
implemented as character No. one (zero is reserves as the C string end
delimiter character).
With the rule
á
we would obtain the correct order
barábas
barabas
But what happens with
bará
bara
This is obviously incorrect, since the second string is shorter than the first one. Thus inserting letters without considering the length of the result is not always a good idea.
One answer to correctly spcifying rules lies in tokenising.
What we actually have is a group of letters that must be sorted correctly. The solution is to map all members of this group as follows:
á
a
à
ä
(Never use :again in these rules! This would lead replace a by a1 by a11 by a111...)
Thus we have produced a set of tokens a1,...,a4 that represent the members of the a-group.
Appending letters to an alphabet can be done exactly the same way:
For Finnish:
z
ä
ö
In German `Arm' and `arm' are two different words (meaning `arm' and `poor' resp.), but usually the lowercase variant preceeds the uppercase one. Both form a token group.
á
Á
a
A
à
À
ä
Ä
In the example of the Finnish alphabet one might write
Z
z
Ä
ä
Ö
ö
Thus, there are several possibilites to resolve this ambiguity but it is essential to do some form of tokenising to obtain predictable results.
Leading hyphens should usually disappear:
star -gate field
This can be accomplished with the follwing rule:
(sort-rule "-" "")
but for the both (different) German words `Druck-Erzeugnis' and `Drucker-Zeugnis' one would expect the first one to appear before the second one. Above rule would in conjunction with case insensitive mapping for both words yield the same sort key, therefore the order of appearance would be not deterministic.
After these initial ideas we were pointed to the French sorting rules (thanks Chris), which can be roughly described as follows (for a detailed description refer to [3])
The French sorting rules sort words in a first run as if they contained no accents. There may some words left that are equal if accents are not considered. For example, the words cote, côte, coté, and côté are equal in this sense. In the second sorting phase each set of words being equal in this sense is ordered considering the accents but sorting lexicographically from right to left (!). The resulting order is then the same order as listed above, which is not obviuos.
Reviewing these examples it is clear that the intially intended sorting mechanism in xindy is hardly capable of handling these rather complex sorting schemes using only string and regular expression substitutions. In fact it is possible, but we would need a lot of regular expression substitutions appending something to the end of strings (this would simulate several runs). Furthermore, this approach would be rather slow and hardly manageable from the user's perspective.