USENIX ;login: - June 2000 FAQ: ISO/IEC 10646-1 Versus Unicode

FAQ: ISO/IEC 10646-1 Versus Unicode

by Alain LaBonté
<alb@sct.gouv.qc.ca>

Alain LaBonté is the head of the Canadian delegation to ISO/IEC JTC1/SC22/WG20, the interna-tionalization (i18n) working group.

Q. Is the ISO/IEC 10646-1 Standard the same thing as Unicode?

A. It is pretty close, yes. The Unicode standard is more restrictive.

Q. Then how is the ISO/IEC 10646-1 Standard different from Unicode?

A. The ISO/IEC 10646-1 Standard is an International Standard (with a capital S only the ISO can claim to have it) that covers:

16-bit or 32-bit code;
"transformed formats" for compatibility with existing transmission standards;
three levels of compliance for the internal representation of characters:
- 1. no composition of characters (all characters fully formed instead of including basic characters followed by diacritics) this excludes many languages but simplifies the life of programming languages for all Western languages without exception (and for languages that do not require composition, such as all Far Eastern languages);
- 2. the composition of some, but not all, characters (obscure level, not sufficiently thought out; will be little used, in my opinion);
- 3. the mix of the technique of composition with the possibility of coding fully formed characters;
total openness in the use of characters (no canonical form, no equivalency between composed characters and fully formed characters);
possible support for dead languages, in addition to all the living languages;
developmental possibilities for all practical purposes unlimited (eventually up to two billion separate characters).

Unicode provides:

exclusively 16-bit code;
a transformed format to allow access to at most a million 32-bit coding characters from ISO/IEC 10646-1 (this is considered amply sufficient for the foreseeable future, even long term, for business purposes);
a canonical form allowing for "normative" equivalency of characters that are precomposed or formed in a predetermined order from a base character and diacriticals;
rigid methods of presentation (no exceptions);
in parallel with this, various other closed methods of processing and presentation (the advantage is that implementation is rigorously predictable); what is noteworthy is that this standard is directly related to a classification method that constitutes a "delta" framed within the ISO/IEC 10651 International Standard.

These are the essential differences, but the coding is essentially the same. Whatever complies with Unicode complies with the International Standard, but the opposite is not necessarily true.

Q. Is Unicode a standard?

A. In the strict sense of the word, it is a de facto standard, and therefore a "private" norm. During a recent debate, the conclusion was that standardization was a process that also included the development of de facto standards. A "de facto standard" is therefore in some ways an object of standardization, but a "de jure standard" is much more than that; it is more general in scope. In short, the difference is relatively fuzzy. But usage tends to distinguish "de facto standards" from proper "de jure standards." (French distinguishes those two terms in single words: "de facto standards" are known as standards and "de jure standards" are known as normes in that language.) With respect to "International Standards," the sole authority that has the right to use this term is the ISO, the International Organization for Standardization, and this by agreement with the major international organizations (such as the WTO, the UN, ITU, etc.). This has importance because when you talk about an International Standard at the WTO within the framework of world trade, you are not referring to just any standard. In this sense, ISO/IEC 10646-1 legitimates the Unicode standard. This is the case for many other important standards.

Q. Sometimes I hear people talking in French about the "Unicode standard." What is a "standard"? In my view, "standard" is the English word that corresponds to the French norme, but I have my doubts. People often talk about "standard" with respect to .doc files, but there is nothing more illogical or incomprehensible than that type of thing! Sometimes I have the impression that standard in French means something like: "This year your system can read it, but next year it will not be able to, so you will have to buy the new 'standard' version of the software!"

A. I could not have said it better. A de facto standard is like strands of spaghetti thrown against a wall. For a while they will stick, and you can count on them being there. If they do not stick, you forget about them. So a standard depends on what is going on at the time, on marketing, and not on a planned desire for consensus. The standard will create a drawing on the wall at some time, a Riopelle, consolidating everything into a consensus, turning the drawing into plastic.

An International Standard is basically a planned Riopelle, i.e., all the strands of spaghetti will not suddenly be thrown against the wall; they will be placed there carefully, one by one, and then the decision will be made as to whether it is an International Standard when 75% of the participating countries in a project are in agreement. People will even hasten to satisfy those who are particularly capricious and have not yet come on board, prior to publication, to the fullest extent possible. The bill for an International Standard is more impartial, but it is also a much more difficult path.

Q. What have UTF-7 and UTF-8 to do with all of this? How do they differ? Or are these two "standards" in the meaning described in the previous paragraph, that is, things that you need to avoid like the plague?

A. UTF-8 is a "transformed format" that has been standardized in ISO/IEC 10646-1, which breaks down each piece of complete 32-bit code into 8-bit pieces, each of which is able to pass through the most capricious 8-bit transmission devices (some 8-bit combinations may not be able to get through, specifically combinations representing control characters). UTF-8 eliminates this limitation by reducing the combinations of bytes used.

UTF-16 is a "transformed format" that has also been standardized in ISO/IEC 10646-1, which allows access to just sixteen 32-bit code planes (1 million characters, instead of 2 billion). Essentially created to break the Unicode standard out of its impasse (initially the Unicode standard was limited to 64,000 characters, but it quickly became apparent that that was not enough for all the Han characters, for example, which you may find particularly interesting); it was able to do this, but is not able to get around the barrier of the finicky transmission devices described above.

UTF-7 is a nonstandardized "transformed format" (which is allowed on the Internet) that can get past the archaic barrier of devices that are on their way out which allow nothing but the safe passage of 7 bits per byte (a waste of 16% of possible data on an 8-bit byte). I see a very limited future here.

UTF-32 is a false "transformed format," nonstandardized, defined by the Unicode consortium, which limits the 32-bit coding to the list allowed by UTF-16 and therefore by the Unicode standard. This is a guarantee that the 32-bit coding used is compatible with Unicode.

UTF-8 is the current trend, the best compromise, for all present and future communication protocols for an "internationalized" and "localizable" Internet. This format allows the unrestricted coding of the entire present and future inventory of the universal character set (ISO/IEC 10646-1) and, therefore, of Unicode. Processing it is, however, less simple than is processing the native 32-bit code. This is an intermediate format that will be especially useful in transmission.