The Language Code Issue

"A shprakh iz a diyalekt mit an armey un a flot"
� Max Weinreich (Joshua Fishman), 1945.

The current situation with regards to language codes and locale codes is a a bit of a mess. We are looking at migrating ICU to use RFC 3066 instead of ISO 639 alone, and find a number of problems. Our goal is to identify the language codes and region codes used in practice in Windows and other platforms, and account for any missing ones so that people can communicate locale information correctly, without loss of data because of missing or mismatched language codes. The issues that we turned up, or were pointed out to us by others, are of interest not only for ICU but also to a broader audience.

In the body of this document, we will refer to a number of different standards or other resources. Here are links to them:

Missing Languages

The usual basis for a locale code is a combination of language code and region code, sometimes augmented with variants. The language code could either be drawn from ISO 639 or from RFC 3066, but there are problems with both. RFC 3066 codes are actually a superset of ISO 636 codes, with one of the following formats:

The possible formats are a bit broader than this, but these are the only important ones in practice.

As far as we are concerned � as a completely practical matter � two languages are different if they require substantially different localized resources. Distinctions according to spoken form are important in some contexts, but the written form is by far and away the most important issue for data interchange. Unfortunately, this is not the principle used in ISO 639, which has the fairly unproductive notion that only spoken language matters (it is also not completely consistent about this, however). If the use of languages happens to correspond to region boundaries expressed as ISO 639 country codes, then we can use RFC 3066 to express the difference, but in many cases this is not true. For example, both simplified and traditional Chinese are used in Hong Kong S.A.R.; both Cyrillic and Latin are used in Serbia, Azerbaijan, and Uzbekistan; Indic languages are customarily written in different scripts, etc.

ISO 15924 contains script codes that could be used, in some cases, to distinguish future language codes in these cases. Unfortunately, RFC 3066 does not permit the productive use of script codes for these cases; each example has to be separately registered, which can take quite a while. (There is good reason for this: ISO 15924 was not ready in time for RFC 3066 to refer to it!)

ISO 15924 does not yet help with simplified Chinese vs. traditional Chinese. ISO 15924 does allow for variant scripts, such as Latf for the Fraktur variant of Latin, or Latg for the Gaelic variant. We would need variant codes for Chinese (Hani) corresponding to simplified and traditional (such as Hans and Hant) if script codes were used to address the above problem. Thus we would end up with language codes like zh-Hans or az-Latn.

Windows provides codes for languages that are not in RFC 3066 in Culture Info, which is an update to the Windows Developing International Software, 2nd Edition. That book claims to use RFC 1766 codes (the predecessor of RFC 3066) for the Culture Names (i.e. locale codes). However, it has codes like az-AZ-Cryl that don't follow the RFC and not in the current registrations.

Best Format

We strongly recommend codes of the form uz-Latn-UZ or zh-Hans-HK, with the script in the middle.

The use of the language codes of the form La-uz-UZ cited above is not optimal. The bare "La" definitely violates RFC 3066; it can't even be registered, since it uses 2 letters, and relies on case distinctions to separate scripts from countries, which is fragile. The four-letter form for the script would be consistent with RFC 3066, and would not be confused in a case-insensitive environment.

But even if the two-letter form is changed to four, the form Latn-uz-UZ is not particularly good. At first glance, it might seem that script is a "greater" difference than language, and should be at the top level. But in practice, the differences between languages are greater than those between scripts, to human users. To learn how to read, at least haltingly, a known language in an unknown script, is a matter of a few days or weeks. To learn how to read, at all, an unknown language in a known script, is a matter of many months or years.

The ordering of script after base language is also better for resource inheritance chains used in systems like Java or ICU. In those chains, one strips off trailing subtags to find the resource parent. Any missing resource, such as string, is inherited from the parent. The resource chain for some samples is shown in (I).

**I. Top Script**
Latn	Cryl
Latn-az	Latn-uz	Cryl-az	Cryl-uz
Latn-az-AZ	Latn-uz-UZ	Cryl-az-AZ	Cryl-uz-UZ

But an isolated Cyrl or Latn at the top is just not very productive. Inheritance with this type of structure doesn't work in practice, because no common resources would be inherited in common, at the top level, between Latn-uz-UZ and Latn-az-AZ. Missing strings would fall back to a completely different language; and in practice an Uzbek is far less likely to understand information from a different language than that in a different script.

The use of the form uz-UZ-Latn is also not optimal, as we see when we look at resource inheritance chains. Suppose we have resources for both Simplified and Traditional Chinese in Hong Kong, Simplified in China and Traditional in Taiwan:

The key different is at the difference between rows 2 and 3. In (II) above, these can vary by script; in (III) below these only vary by region.

**II. Bottom Script**
zh
zh-CN	zh-HK	zh-TW
zh-CN-Hans	zh-HK-Hans	zh-HK-Hant	zh-TW-Hant

**III. Middle Script**
zh
zh-Hans	zh-Hant
zh-Hans-CN	zh-Hans-HK	zh-Hant-HK	zh-Hant-TW

The typical differences between regions tend to be rather small, thus allowing inheritance, whereas the differences between scripts are huge. So if a string is missing at level 3 in (III), at least it gets the right script when inheriting from the parent. And typically differences from the parent are small, allowing most strings to be shared. In case (II), the resources will have some script in zh-HK: the child with a different script cannot effectively inherit any strings. And if a string is missing at level 3, it gets the wrong script when inheriting from the parent. These differences get even more pronounced when there are many different countries sharing the same script variant.

This also fits with the pattern used for yi-latn in RFC 3066 -- Language code assignments. Interestingly, there have not yet been any 3066 assignments of the form <639_code>-<four-letter-code>, so there would not be any possibility of collisions (if we move quickly).

In a sense, on can think of the forms uz-Latn-UZ or zh-Hans-HK, with the script in the middle, as being "big-endian"; supplying the most significant information first, and following on with progressive refinements. This is in line with the current practice in RFC 3066 of language then region. Not coincidentally, this also works the best with resource inheritance.

Matching is the other consideration for the ordering of the tags. Harald mentions the following goals:

The codes az-Latn and az-Cryl are essentially variants of an ISO 639 language code, and could themselves have country variants. Thus in practice we would need to register each code combined with whatever regions made sense, and end up with: az-Latn, az-Latn-AZ, az-Cryl, az-Cryl-AZ,... in addition to the current az, az-AZ. Thus all of the matching options would be available.

It is important to point out that these are two very different uses of language tags: matching vs. accessing. For matching you typically want an "unmarked" case, thus zh matching any Chinese text, no matter how it is written, or zh-Hani matching any Chinese text written with CJK ideographs, whether they be Hant or Hans (or some other as-yet-undistinguished variant). Similarly, "en" exists, and in matching, would match any of en-US, en-Ca, etc. For this first case, matching, there is no notion of "default" text.

For accessing, it is another matter. If I ask for text in "en", I'm going to get one particular variant, because there is no alternative. That is the place where you end up having a default value. Presumably, if you really care what you get, you will ask a more detailed question, such as for text in "en-US", or ask a series of alternatives with weights or a perference ordering, such as: {en-US, en-CA, en, fr-CH, fr}. (The fr-CH for sensible numbers, instead of the equivalent of four-score and nineteen for 99. ☺)

Plan of Action

In any event, the important step is to get the missing written language codes registered. Ideally, there would simply be ISO 639 codes for the missing written languages, but failing that, the first steps would be:

It may take some time to do #3, so it is important to do #2 first to meet the immediate requirements.

Note: In practice, as Peter Edberg remarks, one ends up having to have a default for zh found alone (e.g. simplified), when used in a resource lookup. However, when used as a matching string (see above), one would not make the assumption. If we didn't care about matching, initially, we may be able to get by by with just the following 12 combinations in #2. (Or perhaps fewer, if we limit the Chinese cases to those in actual use in systems, e.g. HK.)

We do need Step #3 eventually, for matching purposes, and since the ways in which languages can be used with different scripts is not closed. Roozbeh points out, for example, that az-Arab and uz-Arab are both used. But there was a reason that RFC 3066 allowed productive use of Language + Region codes: it saves a good detail of time and effort to not have to register all the combinations that anyone wants. Many different languages can be written in multiple scripts, and if one counts all the regions that they could be used in, it would be far better to make this a generative machanism, like it is with ISO Language + Region codes.

Resolving Ambiguities

ICU already supports RFC 3066 codes of forms (1)-(3) in Current RFC 3066 Formats; the types of codes not yet supported are of the form:

One reason for the hesitation in supporting (4) was that there is not a clean line between language codes and locale codes, nor a clean line between the sorts of data is associated with either one. (See Language vs. Locale Codes.) Here is our recommendation for supporting (4) by disambiguating problematic cases. Take the following two examples:

To match the mostly likely usage, we propose interpreting these as (1a) and (2b). That is, if we have x-y-z-... (where trailing fields might be missing), we check y and z to see which are valid 2 or 3 letter region codes. If y and z are both valid region codes, we use (1a), otherwise we use (1b).

As far as resource data goes, we would use the following typical hierarchy for such complex cases. The inheritance goes from the bottom up.

zh lang: zh no region no currency other data: simp. chinese
zh-TW lang: zh-TW region: TW currency: TWD other data: trad-chinese	zh-CN lang: zh region: CN currency: CNY other data: inherited	zh-HK lang: zh region: HK currency: HKD other data: inherited
zh-TW-HK lang: zh-TW region: HK currency: HKD other data: inherited

When resolving locale names, we would map x-y-y to x-y, so zh-TW-TW would map to zh-TW. We can also alias a whole resource bundle (such as for zh-CN-HK) to another resource bundle (such as zh-HK). It would be good, however, to have a canonical mapping to shortest language names, so that such explicit aliasing would not be necessary, and the locales could be transformed to a canonical form before lookup. This would also be needed in dealing with cases like vs az-AZ, az-Latn-AZ, and az-Cryl-AZ , also for cases where either the two letter or three letter ISO 639 or 3166 codes are supplied.

Of course, the situation would be far easier if there were separate ISO 639 codes or RFC 3066 codes for simplified Chinese and traditional Chinese, as discussed above. We would then have zh-Hant-HK instead of zh-TW-HK, etc., with fewer ambiguities. We would still need to have some kind of lookup to get a unique form (zh-Hant-TW => zh-TW), as discussed by Peter Edberg, but that would not be particularly onerous.

In ICU, we have two variant codes with two letters: AL and NY. In both cases, we make longer names for these aliases, and will grandfather in locale data that has these two codes. (NY is not an issue; it is present for backwards compatibility, and does not match an existing region. AL is used in sv_FI_AL; but it is extremely unlikely that anyone would need "Swedish as spoken in Finland, used in Albania", so there is little loss in special casing this one.)

Language vs. Locale Codes

Part of the fuzziness around this whole topic is that people have very slippery notions of what distinguishes a language code vs. a locale code. The problem is that both are somewhat nebulous concepts.

In practice, many people use RFC 3066 codes to mean locale codes instead of strictly language codes. It is easy to see why this came about; because RFC 3066 includes an explicit region (country) code, for most people it was sufficient for use as a locale code as well. For example, when typical web software receives an RFC 3066 code, it use it as a locale code. Other typical software will do the same: in practice, language codes and locale codes are treated interchangeably. Some people recommend distinguishing on the basis of "-" vs "_" (e.g. zh-TW for language code, zh_TW for locale code), but in practice that does not work because of the free variation out in the world in the use of these separators. Notice that Windows, for example, uses "-" as a separator in its locale codes. So pragmatically we are forced to treat "-" and "_" as equivalent.

Another reason for the conflation of these codes is that very little data in most systems is distinguished by region alone; currency codes and measurement systems being some of the few. Sometimes date or number formats are mentioned as regional, but that really doesn't make much sense. If I saw the sentence "You will have to adjust the value to १,२३४.५६७ from ૭૧,૨૩૪.૫૬" (using Indic digits), I would say that sentence is simply not English. Number format is far more closely associated with language than it is with region. The same is true for date formats: I would never expect to see intermixed a date in the format "2003年4月1日" (using Kanji) in text purporting to be purely English. There are regional differences in date and number format � differences which can be important � but those are different in kind than other language differences between regions.

Notice also that currency codes are different than currency localizations. The currency localizations are typically in the language-based resource bundles, not in the region-based resource bundles. Thus, the resource bundle en_US will contain the currency code USD; the resource bundle en will contain no currency code, but will contain the localized mappings in English for a range of different currency codes: USD => $, RUR => Rub, etc. (In protocols, the currency codes should always accompany any currency amounts; otherwise the data is ambiguous, and software is forced to use the user's locale to guess at the currency. But that is subject for a different document, JIT Localization.)

Written Language

Our criteria for what makes a written language are purely pragmatic; what would a copy-editor say? If I gave him/er text like:

S/he would say: Sorry, that is *far* from acceptable English for publication, do it again! So I would change it to either B or C below, depending on which orthographic variant of English was the target for the publication:

Clearly there are many acceptable variations on this text. For example, s/he might still quibble with the use of first vs. last name sorting in the list, but clearly the first list was not acceptable English alphabetical order. And if in quoting a name, like "Theatre Centre News", you may leave it in the source orthography even if it differs from the publication target orthography. And so on.

However, just as clearly, there limits on what is acceptable English, and 2003年3月20日, for example, is not.