The locale
approach can be used to describe the sorting schemes
of may languages. Though, using it in conjunction with xindy has
several disadvantages.
locale
into xindy. But defining a new set of rules would have a significant
overhead. A new table must first be created and compiled with
localedef
into a form readable by the library. This is in
contrast to the dynamic paradigm of xindy and therefore in my
opinion not feasible.
LC_COLLATE
is rather
complex and in no way declarative which initially was a design
goal of xindy Additionally, it solves the problem of pure alphabets
but dealing with markup in the keyword is still an open issue. The
markup can often not be pre-processed since the markup may only play a
rule in one of the later sorting runs. It actually depends on the
user's needs.
Based on these observations my current proposal is a mixture of the
locale
approach and the current implementation in xindy.
sort-rule
can be extended
with an additional argument :level
taking the number of the level
into which the sort rule is to be put. Additionally there must be a
specification on how each run is to be sorted (forward, backward).
These rules may still contain regular expression substitutions which
may come into consideration at any level as necessary.
I'll give an example of how powerful this approach can be:
Assuming we have the following keywords to sort:
\tt{ARM} \it{arm} Arm arm Armbrust armselig
Taking into consideration that we want to sort case-independent at the first level of comparision this can be done with the following rule set:
A -> a \tt{(.*)} -> \1 :again \it{(.*)} -> \1 :again
This obtains the following result:
Arm, arm, \it{arm}, \tt{ARM}
Armbrust
armselig
The intended sorting rule says that the keywords containing markup should come before the others. Thus we must define a rule set expressing this sort order:
\tt{(.*)} -> 0\1 :again \it{(.*)} -> 1\1 :again A -> 2a a -> 2a
Now we have prefixed the letters to obtain a further relative sorting order:
\tt{ARM}
\it{ARM}
Arm, arm
The last step is now to obtain a total order. We do not specify any other rules. since we sort according to the position in the ISO Latin alphabet with A being before a obtaining
Arm
arm
Thus, we have gradually refined the partial order into an total one.
The advantage is that we are still able to use declarative
descriptions such as
\tt{(.*)} -> \1
to match a many keywords
at once.
I have several questions about this scheme and I'm interested in your opinion.
2arm
stuff for example which is actually only a work-around
to re-incorporate tokens in a very uncleanly manner). But is it
necessary to include tokens?