- Overview of Software Internationalization
- ICU Services Overview
- Internationalization and Unicode
- Project Management Tips for Internationalizing Software
Developing globalized software is a continuous balancing act as software developers and project managers inadvertently underestimate the level of effort and detail required to create foreign-language software releases.
Software developers must understand the ICU services to design and deploy successful software releases. The services can save ICU users time in dealing with the kinds of problems that typically arise during critical stages of the software life cycle.
In general, the standard process for creating globalized software includes “internationalization”, which covers generic coding and design issues, and “localization”, which involves translating and customizing a product for a specific market.
Software developers must understand the intricacies of internationalization since they write the actual underlying code. How well they use established services to achieve mission objectives determines the overall success of the project. At a fundamental level, code and feature design affect how a product is translated and customized. Therefore, software developers need to understand key localization concepts.
From a geographic perspective, a locale is a place. From a software perspective, a locale is an ID used to select information associated with a language and/or a place. ICU locale information includes the name and identifier of the spoken language, sorting and collating requirements, currency usage, numeric display preferences, and text direction (left-to-right or right-to-left, horizontal or vertical).
General locale-sensitive standards include keyboard layouts, default paper and envelope sizes, common printers and monitor resolutions, character sets or encoding ranges, and input methods.
The ICU services support all major locales with language and sub-language pairs. The sub-language generally corresponds to a country. One way to think of this is in terms of the phrase “X language as spoken in Y country.” The way people speak or write a particular language might not change dramatically from one country to the next (for example, German is spoken in Austria, Germany, and Switzerland). However, cultural conventions and national standards often differ a great deal.
A key advantage to using the ICU services is the net result in reduced time to market. The translation of the display strings is bundled in separate text files for translation. A programmer team with translators no longer needs to search the source code in order to rewrite the software for each country and language.
Unicode enables a program to use a standard encoding scheme for all textual data within the program’s environment. Conversion has to be done with incoming and outgoing data only. Operations on the text (while it is in the environment) are simplified since you do not have to keep track of the encoding of a particular text.
Unicode supports multilingual data since it encodes characters for all world languages. You do not have to tag pieces of data with their encoding to enable the right characters, and you can mix languages within a single piece of text.
Some of the advantages of using ICU to internationalize your program include the following:
It can handle text in any language or combination of languages.
The source code can be written so that the program can work for many locales.
Configurable, pluggable localization is enabled.
Multiple locales are supported at the same time.
Non-technical people can be given access to information and you don’t have to open the source code to them.
Software can be developed so that the same code can be ported to various platforms.
The following two processes are key when managing, developing and designing a successful internationalization software deliverable:
Separate the program’s executable code from its UI elements.
Avoid making cultural assumptions.
Keep static information (such as pictures, window layouts) separate from the program code. Also ensure that the text which the program generates on the fly (such as numbers and dates) comes out in the right language. The text must be formatted correctly for the targeted user community.
Make sure the analysis and manipulation of both text and kinds of data (such as dates), is done in a manner that can be easily adapted for different languages and user communities. This includes tasks such as alphabetizing lists and looking for line-break positions.
Characters must display on the screen correctly (the text’s storage format must be translated to the proper visual images). They must also be accepted as input (translated from keystrokes, voice input or another kind of input into the text’s storage format). These processes are relatively easy for English, but quite challenging for other languages.
Good software design requires that the programming code implementing the user interface (UI) be kept separate from code implementing the underlying functionality. The description of the UI must also be kept separate from the code implementing it.
The description of the UI contains items that the user sees, including the various messages, buttons, and menu commands. It also contains information about how dialog boxes are to be laid out, and how icons, colors or other visual elements are to be used. For example, German words tend to be longer since they contains grammatical suffixes that English has lost in the last 800 years. The following table shows how word lengths can differ among languages.
The description of the UI, especially user-visible pieces of text, must be kept together and not embedded in the program’s executable code. ICU provides the ResourceBundle services for this purpose.
Another difficulty encountered when designing and implementing code is to make it flexible enough to handle different ways of doing things in other countries and cultures. Most programmers make unconscious assumptions about their user’s language and customs when they design their programs. For example, in Thailand, the official calendar is the Buddhist calendar and not the Gregorian calendar.
These assumptions make it difficult to translate the user interface portion of the code for some user communities without rewriting the underlying program. The ICU libraries provide flexible APIs that can be used to perform the most common and important tasks. They contain pre-built supporting data that enables them to work correctly in 75 languages and more than 200 locales. The key is understanding when, where, why, or how to use the APIs effectively.
The remainder of this section provides an overview of some cultural and hidden assumptions components. See a list of topics below:
- Numbers and Dates
- Measuring Units
- Alphabetical Order of Characters
- Text Input and Layout
- Text Manipulation
- Date/Time Formatting
- Distributed Locale Support
Numbers and dates are represented in different languages. Do not implement routines for converting numbers into strings, and do not call low-level system interfaces like
sprintf() that do not produce language-sensitive results. Instead, see how ICU’s NumberFormat and DateFormat services can be used more effectively.
Be careful when formulating assumptions about how individual pieces of text are used together to create a complete sentence (for example, when error messages are generated). The elements might go together in a different order if the message is translated into a new language. ICU provides MessageFormat (§) and ChoiceFormat (§) to help with these occurrences.
Note: There also might be situations where parts of the sentence change when other parts of the sentence also change (selecting between singular and plural nouns that go after a number is the most common example).
Numerical representations can change with regard to measurement units and currency values. Currency values can vary by country. A good example of this is the representation of $1,000 dollars. This amount can represent either U.S. or Canadian dollar values. US dollars can be displayed as USD while Canadian dollars can be displayed as CAD, depending on the locale. In this case, the displayed numerical quantity might change, and the number itself might also change. NumberFormat provides some support for this.
All languages (even those using the same alphabet) do not necessarily have the same concept of alphabetical order. Do not assume that alphabetical order is the same as the numerical order of the character’s code-point values. In practice, ‘a’ is distinct from ‘A’ and ‘b’ is distinct from ‘B’. Each has a different code point . This means that you cannot use a bit-wise lexical comparison (such as what strcmp() provides), to sort user-visible lists.
Not all languages interpret the same characters as equivalent. If a character’s case is changed it is not always a one-to-one mapping. Accent differences, the presence or absence of certain characters, and even spelling differences might be insignificant when determining whether two strings are equal. The Collator services provide significant help in this area.
A character does not necessarily correspond to a single code-point position in the backing store. All languages might not have the same definition of a word, and might not find that any group of characters separated by a white space is an acceptable approximation for the definition of a word. ICU provides the BreakIterator services to help locate boundaries or when counting units of text.
When checking characters for membership in a particular class, do not list the specific characters you are interested in, and do not assume they come in any particular order in the encoding scheme. For example, /A-Za-z/ does not mean all letters in most European languages, and /0-9/ does not mean all digits in many writing systems. This also holds true when using C interfaces such as
islower(). ICU provides a large group of utility functions for testing character properties, such as
Do not assume anything about how a piece of text might be drawn on the screen, including how much room it takes up, the direction it flows, or where on the screen it should start. All of these text elements vary according to language. As a result, there might not be a one-to-one relationship between characters and keystrokes. One-to-many, many-to-one, and many-to-many relationships between characters and keystrokes all occur in real text in some languages.
Do not assume that all textual data, which the program stores and manipulates, is in any particular language or writing system. ICU provides many methods that help with text storage. The
UnicodeString class and
u_strxxx functions are provided for Unicode-based character manipulation. For example, when appending an existing Unicode character buffer, characters can be removed or extracted out of the buffer.
A good example of text manipulation is the Rosetta stone. The same text is written on it in Hieroglyphic, Greek and Demotic. ICU provides the services to correctly process multi-lingual text such as this correctly.
Time can be determined in many units, such as the lengths of months or years, which day is the first day of the week, or the allowable range of values like month and year (with
DateFormat). It can also determine the time zone you are in (with
TimeZone), or when daylight-savings time starts. ICU provides the Calendar services needed to handle these issues.
In most server applications, do not assume that all clients connected to the server interact with their users in the same language. Also do not assume that a session stops and restarts whenever a user speaking one language replaces another user speaking a different language. ICU provides sufficient flexibility for a program to handle multiple locales at the same time.
For example, a Web server needs to serve pages to different users, languages, and date formats at the same time.
The ICU LayoutEngine is an Open Source library that provides a uniform, easy to use interface for preparing complex scripts or text for display. The Latin script, which is the most commonly used script among software developers, is also the least complex script to display especially when it is used to write English. Using the Latin script, characters can be displayed from left to right in the order that they are stored in memory. Some scripts require rendering behavior that is more complicated than the Latin script. We refer to these scripts as “complex scripts” and to text written in these scripts as “complex text.”