JIT Localization

2003.03.21, M. Davis

In a world of on-demand software components, with arbitrary connections between those components, it is important to get a sense of where localization should be done, and how to transmit enough information so that it can be done at that appropriate place. End-users need to get messages localized to their languages, messages that not only contain a translation of text, but also contain variables such as date, time, number formats, and currencies formatted according to the users' conventions. The strategy for doing JIT localization is made up of two parts:

Store and transmit neutral-format data wherever possible.
- Neutral-format data is data that is kept in a standard format, no matter what the local user's environment is. It is also (loosely) called binary data, even though it actually could be represented in many different ways, including a textual representation such as in XML.
- Such data should use accepted standards where possible, such as for currency codes (see below).
- Textual data should also be in a uniform character set (Unicode/10646) to avoid possible data corruption problems when converting between encodings.
Localize that data as "close" to the end-user as possible.

There are a number of advantages to this strategy. The longer the data is kept in a neutral format, the more flexible the entire system is. On a practical level, if transmitted data is neutral-format, then it is much easier to manipulate the data, debug the processing of the data, and maintain the software connections between components.

Once data has been localized into a given language, it can be quite difficult to programmatically convert that data into another format, if required. This is especially true if the data contains a mixture of translated text and formatted variables. Once information has been localized into, say, Romanian, it is much more difficult to localize that data into, say, French. Parsing is more difficult than formatting, and may run up against different ambiguities in interpreting text that has been localized, even if the original translated message text is available (which it may not be).

Moreover, the closer we are to end- user, the more we know about that user's preferred formats. If we format dates, for example, at the user's machine, then it can easily take into account any customizations that the user has specified. If the formatting is done elsewhere, either we have to transmit whatever user customizations are in play, or we only transmit the user's locale code, which may only approximate the desired format. Thus the closer the localization is to the end user, the less we need to ship all of the user's preferences arond to all the places that localization could possibly need to be done.

Even though localization should be done as close to the end-user as possible, there will be cases where different components need to be aware of whatever settings are appropriate for doing the localization. Thus information such as a locale code or timezone needs to be communicated between different components. There are a number of connected issues discussed below.

Locale Codes

In most cases, the notion of a locale is fairly well accepted, as well as the kind of information associated with a given locale code. However, there are a number of outstanding problems with the current mechanisms naming locales. For more information, see language_code_issues.html.

In addition, the exact results produced from formatting or parsing text may vary according to the locale data available on the given machine. That is one reason why the Common Locale Repository project has been constituted.

Message Formatting and Exceptions

Windows (FormatMessage, String.Format), Java (MessageFormat) and ICU (MessageFormat, umsg) all provide methods of formatting variables (dates, times, etc) and inserting them at arbitrary positions in a string. This avoids the manual string concatenation that causes severe problems for localization. The question is, where to do this? It is especially important since the original code site that originates a particular message may be far down in the bowels of a component, and passed up to the top of the component with an exception. So we will take that case as representative of this class of issues.

There are circumstances where the message can be communicated with a language-neutral code, such as a numeric error code or mnemonic string key, that is understood outside of the component. If there are arguments that need to accompany that message, such as a number of files or a datetime, those need to accompany the numeric code so that when the localization is finally at some point, the full information can be presented to the end-user. This is the best case for localization.

More often, the exact messages that could originate from within the component are not known outside of the component itself; or at least they may not be known by the component that is finally displaying text to the user. In such a case, the information as to the user's locale needs to be communicated in some way to the component that is doing the localization. That locale information does not necessarily need to be communicated deep within the component; ideally, any exceptions should bundle up some language-neutral message ID, plus the arguments needed to format the message (e.g. datetime), but not do the localization at the throw site. This approach has the advantages noted above for JIT localization.

In addition, exceptions are often caught at a higher level; they don't end up being displayed to any end-user at all. By avoiding the localization at the throw site, it the cost of doing formatting, when that formatting is not really necessary. In fact, in many running programs most of the exceptions that are thrown at a low level never end up being presented to an end-user, so this can have considerable performance benefits.

Currencies

There is some confusion over the use of currencies that should be addressed. A currency amount logically consists of a numeric value, plus an accompanying ISO 4217 currency code (or equivalent). The currency code may be implicit in a protocol, such as where USD is implicit. But if the raw numeric value is transmitted without any context, then it has no definitive interpretation.

Notice that the currency code is completely independent of the end-user's language or locale. For example, RUR is the code for Russian Rubles. A currency amount of <RUR, 1.23457�10�> would be localized for a Russian user into "1 234,57р." (using U+0440 (р) cyrillic small letter er). For an English user it would be localized into the string "Rub 1,234.57" The end-user's language is needed for doing this last localization step; but that language is completely orthogonal to the currency code needed in the data. After all the same English user could be working with dozens of currencies. (See for example: NSDSA Currency Table and UNECE Currency Data.) Notice also that the currency code is also independent of whether currency values are inter-converted, which requires more interesting financial processing: the rate of conversion may depend on a variety of factors.

Thus logically speaking, once a currency amount is entered into a system, it should be logically accompanied by a currency code in all processing. This currency code is independent of whatever the user's original locale was. IOnly in badly-designed software is the currency code (or equivalent) not present, so that the software has to "guess" at the currency code based on the user's locale.

Time Zones

A similar issue arises with time. Transmitting "14:30" with no other context is incomplete unless it contains information about the time zone. Ideally one would transmit neutral-format date/time information, commonly in UTC, and localize as close to the user as possible. (For more about UTC, see NIST Time and Frequency Division Home Page, U.S. Naval Observatory: What is Universal Time?.)

The conversion from local time into UTC depends on the particular time zone rules, which will vary by location. The standard data used for converting local time (sometimes called wall time) to UTC and back is the Olson data, used by UNIX, Java, ICU, and others. The data includes rules for matching the laws for time changes in different countries. For example, for the US it is:

"During the period commencing at 2 o'clock antemeridian on the first Sunday of April of each year and ending at 2 o'clock antemeridian on the last Sunday of October of each year, the standard time of each zone established by sections 261 to 264 of this title, as modified by section 265 of this title, shall be advanced one hour..." (United States Law - 15 U.S.C. �6(IX)(260-7)).

Each region that has a different timezone or daylight savings time rules, either now or at any time in the past, is given a unique internal ID, such as Europe/Paris. As with currency codes, these are internal codes that should be localized if exposed to a user (such as in the Windows Control Panels>Date/Time>Time Zone).

Unfortunately, laws change over time, and will continue to change in the future, both for the boundaries of timezone regions and the rules for daylight savings. Thus the Olson data is continually being augmented. Any two implementations using the same version of the Olsen data will get the same results for the same IDs (assuming a correct implementation). However, if implementations use different versions of the data they may get different results. So if precise results are required then both the Olson ID and the Olson data version must be transmitted between the different implementations.

There are, however, circumstances where extra information needs to be stored along with the UTC data, especially with recurring events. For example, suppose that I specify that a delivery is to be made each day at 14:30 PT (Pacific Time). One cannot simply store that the delivery is to be made at 22:30 UTC, since that would be 14:30 PT during the winter and 15:30 PT during the summer. Recurring times are subject to a number of other complications. For example, suppose I specified 14:30 PT each business day; one would need to know what constitutes a "business day", which may depend on federal, state, or local laws. Other complications arise when one considers events at the same, such as deliveries at 14:30 in multiple states.