XINDY: Problem

1. Problem

One of the most interesting new features of xinfy is its ability to offer mechanisms to sort arbitrary keyword using the sort mapping. This scheme was first proposed in [1] and implemented in the International MakeIndex [2]. This scheme has been integrated into the current version of xindy.

While playing around with the sort-rule mechanisms of xindy I was confronted with the problem that there often exist many ways to accomplish as particular sorting order but I was not aware of the underlying rules how to write a correct sorting specification.

Therefore, I wrote down a simple guideline which is presented.

1.1 Initial guideline

Note: This guideline was a first attempt to understande the meta-rules of writing sort rules and I'll give it here to make you able to get into the discussion!

How to write correct sort and merge rules?

One of the biggest hurdle in mastering xindy is the correct declaration of sort and merge rules. To bring a little bit of structure into this problem area we have prepared a set of rules-of-thumb that may help you to correclt specify rules.

First of all, if you have ever written a set of rules and the result is what you expect, then your rules are correct, since correctness is only a question of the result and not the way you did it.

But you may run into difficulties later if the rule set is not correct meaning that your rules may fail under certain circumstances.

Sort Rules

What you always must keep in mind is that the sort rules are applied to the result of the merge mapping. Thus, if you have already made a mistake in the merge mapping the sort rules often do not work as expected but actually may be correct!

Inserting Letters

The most common problem is to define an alphabet that is based on the latin alphabet but is extended by some letters. Inserting a letter often actually means to sort it before and after a letter, thus it is relative to an existing letter. For example the letter á must in some languages be sorted before a. In this case we should specify the following sort rule using the special letter ~b having a sorting order which is lower than all other characters. Actually it is implemented as character No. one (zero is reserves as the C string end delimiter character).

With the rule

á -> a~b

we would obtain the correct order

barábas -> bara~bbas

barabas -> barabas

But what happens with

bará -> bara~b

bara -> bara

This is obviously incorrect, since the second string is shorter than the first one. Thus inserting letters without considering the length of the result is not always a good idea.

The principle of tokenising

One answer to correctly spcifying rules lies in tokenising.

What we actually have is a group of letters that must be sorted correctly. The solution is to map all members of this group as follows:

á -> a0

a -> a1

à -> a2

ä -> a3

(Never use :again in these rules! This would lead replace a by a1 by a11 by a111...)

Thus we have produced a set of tokens a1,...,a4 that represent the members of the a-group.

Appending letters to an alphabet can be done exactly the same way:

For Finnish:

z -> z~b (or z0)

ä -> za (or z1)

ö -> zo (or z2)

Upper and lowercase letters

In German `Arm' and `arm' are two different words (meaning `arm' and `poor' resp.), but usually the lowercase variant preceeds the uppercase one. Both form a token group.

á -> a0 (or a00)

Á -> a1 (or a01)

a -> a2 (or a02)

A -> a3 (or a03)

à -> a4 (or a10)

À -> a5 (or a11)

ä -> a6 (or a12)

Ä -> a7 (or a13)

In the example of the Finnish alphabet one might write

Z -> zZ

z -> zz

Ä -> zA

ä -> za

Ö -> z0

ö -> zo

Thus, there are several possibilites to resolve this ambiguity but it is essential to do some form of tokenising to obtain predictable results.

The Problem with Hyphens

Leading hyphens should usually disappear:

star
   -gate
   field

This can be accomplished with the follwing rule:

(sort-rule "-" "")

but for the both (different) German words `Druck-Erzeugnis' and `Drucker-Zeugnis' one would expect the first one to appear before the second one. Above rule would in conjunction with case insensitive mapping for both words yield the same sort key, therefore the order of appearance would be not deterministic.

1.2 More Problems

After these initial ideas we were pointed to the French sorting rules (thanks Chris), which can be roughly described as follows (for a detailed description refer to [3])

The French sorting rules sort words in a first run as if they contained no accents. There may some words left that are equal if accents are not considered. For example, the words cote, côte, coté, and côté are equal in this sense. In the second sorting phase each set of words being equal in this sense is ordered considering the accents but sorting lexicographically from right to left (!). The resulting order is then the same order as listed above, which is not obviuos.

1.3 First Summary

Reviewing these examples it is clear that the intially intended sorting mechanism in xindy is hardly capable of handling these rather complex sorting schemes using only string and regular expression substitutions. In fact it is possible, but we would need a lot of regular expression substitutions appending something to the end of strings (this would simulate several runs). Furthermore, this approach would be rather slow and hardly manageable from the user's perspective.