UTF-8

Note: This page is only relevant for C/C++. In Java, all strings are encoded in UTF-16, except for conversion from bytes to strings (via InputStreamReader or similar) and from strings to bytes (OutputStreamWriter etc.).

While most of ICU works with UTF-16 strings and uses data structures optimized for UTF-16, there are APIs that facilitate working with UTF-8, or are optimized for UTF-8, or work with Unicode code points (21-bit integer values) regardless of string encoding. Some data structures are designed to work equally well with UTF-16 and UTF-8.

For UTF-8 strings, ICU normally uses (const) char * pointers and int32_t lengths, normally with semantics parallel to UTF-16 handling. (Input length=-1 means NUL-terminated, output is NUL-terminated if there is space, output overflow is handled with preflighting; for details see the parent Strings page.) Some newer APIs take an icu::StringPiece argument and write to an icu::ByteSink or to a string class object like std::string.

Conversion Between UTF-8 and UTF-16

The simplest way to use UTF-8 strings in UTF-16 APIs is via the C++ icu::UnicodeString methods fromUTF8(const StringPiece &utf8) and toUTF8String(StringClass &result). There is also toUTF8(ByteSink &sink).

In C, unicode/ustring.h has functions like u_strFromUTF8WithSub() and u_strToUTF8WithSub(). (Also u_strFromUTF8(), u_strToUTF8() and u_strFromUTF8Lenient().)

The conversion functions in unicode/ucnv.h are intended for very flexible handling of conversion to/from external byte streams (with customizable error handling and support for split buffers at arbitrary boundaries) which is normally unnecessary for internal strings.

Note: icu::UnicodeString has constructors, setTo() and extract() methods which take either a converter object or a charset name. These can be used for UTF-8, but are not as efficient or convenient as the fromUTF8()/toUTF8()/toUTF8String() methods mentioned above. (Among conversion methods, APIs with a charset name are more convenient but internally open and close a converter; ones with a converter object parameter avoid this.)

UTF-8 as Default Charset

ICU has many functions that take or return char * strings that are assumed to be in the default charset which should match the system encoding. Since this could be one of many charsets, and the charset can be different for different processes on the same system, ICU uses its conversion framework for converting to and from UTF-16.

If it is known that the default charset is always UTF-8 on the target platform, then you should #define U_CHARSET_IS_UTF8 1 in or before unicode/utypes.h. (For example, modify the default value there or pass -DU_CHARSET_IS_UTF8=1 as a compiler flag.) This will change most of the implementation code to use dedicated (simpler, faster) UTF-8 code paths and avoid dependencies on the conversion framework. (Avoiding such dependencies helps with statically linked libraries and may allow the use of UCONFIG_NO_LEGACY_CONVERSION or even UCONFIG_NO_CONVERSION [see unicode/uconfig.h].)

Low-Level UTF-8 String Operations

unicode/utf8.h defines macros for UTF-8 with semantics parallel to the UTF-16 macros in unicode/utf16.h. The macros handle many cases inline, but call internal functions for complicated parts of the UTF-8 encoding form. For example, the following code snippet counts white space characters in a string:

#include "unicode/utypes.h"
#include "unicode/stringpiece.h"
#include "unicode/utf8.h"
#include "unicode/uchar.h"

int32_t countWhiteSpace(StringPiece sp) {
    const char *s=sp.data();
    int32_t length=sp.length();
    int32_t count=0;
    for(int32_t i=0; i<length;) {
        UChar32 c;
        U8_NEXT(s, i, length, c);
        if(u_isUWhiteSpace(c)) {
            ++count;
        }
    }
    return count;
}

Dedicated UTF-8 APIs

ICU has some APIs dedicated for UTF-8. They tend to have been added for “worker functions” like comparing strings, to avoid the string conversion overhead, rather than for “builder functions” like factory methods and attribute setters.

For example, icu::Collator::compareUTF8() compares two UTF-8 strings incrementally, without converting all of the two strings to UTF-16 if there is an early base letter difference.

ucnv_convertEx() can convert between UTF-8 and another charset, if one of the two UConverters is a UTF-8 converter. The conversion from UTF-8 to most other charsets uses a dedicated, optimized code path, avoiding the pivot through UTF-16. (Conversion from other charsets to UTF-8 could be optimized as well, but that has not been implemented yet as of ICU 4.4.)

Other examples: (This list may or may not be complete.)

  • ucasemap_utf8ToLower(), ucasemap_utf8ToUpper(), ucasemap_utf8ToTitle(), ucasemap_utf8FoldCase()
  • ucnvsel_selectForUTF8()
  • icu::UnicodeSet::spanUTF8(), spanBackUTF8() and uset_spanUTF8(), uset_spanBackUTF8() (These are highly optimized for UTF-8 processing.)
  • ures_getUTF8String(), ures_getUTF8StringByIndex(), ures_getUTF8StringByKey()
  • uspoof_checkUTF8(), uspoof_areConfusableUTF8(), uspoof_getSkeletonUTF8()

Abstract Text APIs

ICU offers several interfaces for text access, designed for different use cases. (Some interfaces are simply newer and more modern than others.) Some ICU services work with some of these interfaces, and for some of these interfaces ICU offers UTF-8 implementations out of the box.

UText can be used with BreakIterator APIs (character/word/sentence/… segmentation). utext_openUTF8() creates a read-only UText for a UTF-8 string.

  • Note: In ICU 4.4 and before, BreakIterator only works with UTF-8 (or any other charset with non-1:1 index conversion to UTF-16) if no dictionary is supported. This excludes Thai word break. See ticket #5532.
  • As a workaround for Thai word breaking, you can convert the string to UTF-16 and convert indexes to UTF-8 string indexes via u_strToUTF8(dest=NULL, destCapacity=0, *destLength gets UTF-8 index).
  • ICU 4.4 has a technology preview for UText in the regular expression API, but some of the UText regex API and semantics are likely to change for ICU 4.6. (Especially indexing semantics.)

A UCharIterator can be used with several collation APIs (although there is also the newer icu::Collator::compareUTF8()) and with u_strCompareIter(). uiter_setUTF8() creates a UCharIterator for a UTF-8 string.

It is also possible to create a CharacterIterator subclass for UTF-8 strings, but CharacterIterator has a lot of virtual methods and it requires UTF-16 string index semantics.