A new semester began last Tuesday with my XML and Related Languages course. Because of Martin Luther King, Jr. Day, the first Monday of the semester is after the first Tuesday, and tomorrow Software Development Methodologies will meet for the first time.
(Last semester went very well. I’m a much better student than I was ten years ago. I guess that’s what happens when I pick something that I’m good at rather than stubbornly picking mathematics classes. Oh what a quixotic four years those were.)
The internationalization discussion continues. For those who don’t know already, not everyone in the world who uses a computer knows (or wants to know) English. Internationalization refers to the practice of writing software that can be easily adapted to another locale. This involves a lot of different things: character sets; text direction; currency symbols; formats for dates, times, and numbers; spelling rules; etc. And sometimes even plain-old English speakers need to use accented characters.
My XML course is the second technical class in a row where character encoding has come up. You might remember that Perl has ample support for multiple languages. A developer can create variable names and values in almost any language. Similarly, XML allows almost any Unicode characters in elements, attributes, and values. I’m sensing a trend in these newer, widely adopted languages that form the basis of our information economy.
But once again, in my class there was a fair bit of confusion about terminology and concepts. I’m no expert, but I think I know what Spolsky says is the “The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets.” So here goes:
Every symbol you can use to create a word in a particular language is called a character. Some languages have few characters. Others, like Chinese, have many, many more. Hindi and Hebrew and Arabic have a set of base characters along with special ligature marks for vowel sounds (e.g., क = ka, कि = ki, को = ko, कु = ku, कै = kai, etc.) and for combining characters (न + द = न्द [nd] and प + त = प्त [pt]). Multiple languages often use many of the same characters; consider French, German, and Spanish. Don’t confuse characters with the typefaces or fonts that present characters. Switching from Arial to Courier doesn’t change the underlying meaning of a character, just its form.
A character set is the group of all characters from all languages that a given system can describe. Most character sets have relatively few characters — ASCII, for example, has 128 — while Unicode’s Universal Character Set has around 100,000 characters. There are lots of character sets because back in the olden days before widespread information globalization locales rarely interacted — or rarely interacted easily — and everyone did what was easy. (Sometimes, you really are going to need it.)
Intimately related to a character set is character encoding, which assigns a value to each character. Computers can’t read and don’t really understand characters, but they do very well with numbers and abstraction. Encoding maps characters to numbers and vice versa. Storing characters in memory or on disk is simply a matter of using the right value for the character.
This brings up a huge, earth-shaking gotcha. A computer byte — the basic unit of computer memory — can only hold 256 values, but Unicode allows for more than a million characters. For decades, programmers using Roman, Greek, and Cyrillic character sets saw no problem with this and stored their characters in one byte. Encodings overlapped but incompletely, and no one gave much regard to the future. In this modern multilingual computing world we need anywhere between one and three bytes to store all of the characters in the Universal Character Set.
The Unicode standard has two parts, one of which describes the characters unambiguously and the other which says how to map the numeric values to bytes. There’s more than one way to do this. Among these are UTF-8, UTF-16, and UTF-32. I don’t know all of the gory details of these encoding schemes, and I don’t think you would want to know either. But here’s what you should know: A character no longer is the same as a byte. It might be two bytes; it might be three or even four. In UTF-8 a special value in the first byte indicates that there are more bytes coming for this one character.
If you’re a programmer and live in the old “1 character = 1 byte” world, you might get lucky because ASCII is a subset of UTF-8. All ASCII documents are also UTF-8. The bad news, and it’s pretty bad, is that UTF-8 documents aren’t always ASCII, and in the future even fewer will be. If you treat all files as ASCII, you’re going to get it wrong . . . a lot.
Unfortunately, the way forward involves a chicken and egg problem. In a text document, how do you know the character encoding? Is the document ASCII or UTF-8?
This, my friends, is why I work with non-text formats that have specifications that give the answers to these questions (or allow you to assume ASCII).




No user commented in " Semester #2: Continuation of an International Theme "
Follow-up comment rss or Leave a TrackbackLeave A Reply