Sorting words

Previously, I’ve posted about using non-English characters in programming and the basics of character set encoding. It appears that I need to keep adding to this theme of internationalization and Unicode.

In particular, the issue of collation — how to sort characters — came up in my UNIX Tools class. I knew that the sort order of words and strings depend on the localization, or “locale,” settings your system uses; as does the decimal separator, currency symbol, and a raft of other settings. For example, here’s the sort order of some files on my Mac OSX system. (The export command sets an environment variable.)

jeff-mathers-computer:~/tmp jeff$ export LANG=de_DE

jeff-mathers-computer:~/tmp jeff$ echo *.txt
A.txt a1.txt passé.txt passe.txt strabismus.txt strasberg.txt
strasse.txt strategery.txt straße.txt تاببا.txt تبب.txt देवनगारि.txt

jeff-mathers-computer:~/tmp jeff$ export LANG=de_DE.ISO8859-1

jeff-mathers-computer:~/tmp jeff$ echo *.txt
A.txt تاببا.txt تبب.txt देवनगारि.txt a1.txt passe.txt passé.txt
straße.txt strabismus.txt strasberg.txt strasse.txt strategery.txt

jeff-mathers-computer:~/tmp jeff$ export LANG=de_DE.UTF-8     

jeff-mathers-computer:~/tmp jeff$ echo *.txt
A.txt a1.txt passé.txt passe.txt strabismus.txt strasberg.txt
strasse.txt strategery.txt straße.txt تاببا.txt تبب.txt देवनगारि.txt

The Mac finder lists the files in yet a different way: A.txt, a1.txt, passe.txt, passé.txt, strabismus.txt, strasberg.txt, strasse.txt, straße.txt, strategery.txt, تاببا.txt, تبب.txt, देवनगारि.txt. Your system may be different.

The appearance of the term “UTF-8” surprised me, because I only knew about it in the context of character encoding. But it turns out that UTF-8 (along with the other Unicode character encodings) also includes the Unicode Collation Algorithm, descriptions about how to combine different characters together into one visual character (that is, how to make ligatures), and even more.

A key thing to remember seems to be that there’s no single collation order for a given character set and that these collation orders depend partly on the language of the locale. That makes sense; how you sort accented letters varies from language to language.

This entry was posted in Computing, Software Engineering. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>