Check out the new USENIX Web site. next up previous
Next: The Locale Up: Internationalization Issues Previous: Internationalization Issues

Message IDs

Not every string in an application needs to be translated. For example, some strings are used as keys in dictionaries, or represent mail headers, or contain HTML tags. To make the proper distinction we refer to strings that are intended for human readability as ``text'' or ``messages''. One of the most labor intensive parts of internationalizing an existing code base such as Mailman's is to go through every string in the software and distinguish messages from ordinary strings. In addition to the non-translatable strings described above, the decision was made to not translate log messages since these are not intended for the end-user, and would make debugging in global community more difficult.

Each message that is to be translated needs to have four pieces of information at runtime in order to calculate the translated text: the application domain, the message id, the default text, and the target locale. Because Mailman is a fairly self-contained application, there is only one static domain, the ``mailman'' domain, which never changes during the life of the program's execution.

The message id and default text are two related, but distinct concepts. The message id uniquely identifies the textual message to be displayed to the user. The message id names the message but it may not necessarily be the message. It is the message id which is the primary key into a translation catalog dictionary.

The default text is the text to use as the translation of the message id, when the id is not found in the translation catalog. Because coordinating 20 different language teams is a project management challenge, it is common for some language catalogs to lag behind the source code development. Mailman releases are rarely delayed so that language teams can catch up (although advance notice of impending releases is usually given). It is often the case, therefore, that a particular message id won't be found in a specific language catalog. The default text is the fall back to use in this case.

As an example, suppose a web form had a Delete button. The message id for the button might be something like ``form27-delete-button'', while the default text might be ``Delete''.

Message ids may be explicit or implicit. In the above example ``form27-delete-button'' is an explicit message id. While it uniquely identifies the message to be used, it does not contain any text that will be displayed to the user. The advantage of explicit message ids is that they are immune to minor typos or formatting changes (e.g. whitespace or punctuation additions or deletions). The disadvantages of explicit message ids are two-fold: they require an extra catalog mapping message ids to the default language (e.g. English in Mailman's case), and they make the source code less readable. The latter is the more serious consequence; since nearly all human readable text in Mailman exists in Python source code, using explicit message ids would make the code nearly unreadable. A developer would have to consult the English catalog several times for some lines of code.

The alternative approach is to use implicit message ids, where the message id serves a dual purpose as the default text. Thus the human readable text that appears in the Python source code is first used as the message id, and if that fails to find a translation, it is used as the default text. While this has the advantage of making the source code more readable and easier to develop, it has several disadvantage. First, a message such as ``Delete'' which has one spelling in English, may be translated to one of several different words in another language, depending on the context. This poses a problem for the translator because the message id ``Delete'' may appear a dozen times in the application, but may require several different words in the target language. Also, minor changes in formating or punctuation change the message id, which requires a re-translation (this may be considered an advantage because changes in punctuation can cause semantic differences, requiring a re-translation anyway).

There is no perfect solution, but Mailman has decided to use implicit message ids because of the source code readability advantages. This occasionally requires negotiation between the application developers and the translation teams to choose appropriate and distinguishable message ids, and imposes a sort of inertia against changing existing text in the source code. One way to alleviate these problems in future releases would be to use a mix of implicit and explicit message ids, where implicit ids are used predominantly, but in rare cases explicit ids (along with a partial English catalog) are used to resolve ambiguities.


next up previous
Next: The Locale Up: Internationalization Issues Previous: Internationalization Issues
Barry Warsaw 2003-04-08