Compression

Contents

  1. Overview of SCSU
  2. BOCU-1
  3. Usage

Overview of SCSU

Compressing Unicode text for transmission or storage results in minimal bandwidth usage and fewer storage devices. The compression scheme compresses Unicode text into a sequence of bytes by using characteristics of Unicode text. The compressed sequence can be used on its own or as further input to a general purpose file or disk-block based compression scheme. Note that the combination of the Unicode compression algorithm plus disk-block based compression produces better results than either method alone.

Strings in languages using small alphabets contain runs of characters that are coded close together in Unicode. These runs are typically interrupted only by punctuation characters, which are themselves coded in proximity to each other in Unicode (usually in the Basic Latin range).

For additional detail about the compression algorithm, which has been approved by the Unicode Consortium, please refer to Unicode Technical Report #6 (A Standard Compression Scheme for Unicode).

The Standard Compression Scheme for Unicode (SCSU) is used to:

  • express all code points in Unicode

  • approximate the storage size of traditional character sets

  • facilitate the use of short strings

  • provide transparency for characters between U+0020-U+00FF, as well as CR, LF and TAB

  • support very simple decoders

  • support simple as well as sophisticated encoders

It does not attempt to avoid the use of control bytes (including NUL) in the compressed stream.

The compression scheme is mainly intended for use with short to medium length Unicode strings. The resulting compressed format is intended for storage or transmission in bandwidth limited environments. It can be used stand-alone or as input to traditional general purpose data compression schemes. It is not intended as processing format or as general purpose interchange format.

BOCU-1

A MIME compatible encoding called BOCU-1 is also available in ICU. Details about this encoding can be found in the Unicode Technical Note #6. Both SCSU and BOCU-1 are IANA registered names.

Usage

The compression service in ICU is a part of Conversion framework, and follows the semantics of converters. For more information on how to use ICU’s conversion service, please refer to the Usage Model section in the Using Converters chapter.

uint16_t germanUTF16[]={
    0x00d6, 0x006c, 0x0020, 0x0066, 0x006c, 0x0069, 0x0065, 0x00df, 0x0074
};

uint8_t germanSCSU[]={
    0xd6, 0x6c, 0x20, 0x66, 0x6c, 0x69, 0x65, 0xdf, 0x74
};
char target[100];
UChar uTarget[100];
UErrorCode status = U_ZERO_ERROR;
UConverter *conv;
int32_t len;

/* set up the SCSU converter */
conv = ucnv_open("SCSU", &status);
assert(U_SUCCESS(status));

/* compress the string using SCSU */
len = ucnv_fromUChars(conv, target, 100, germanUTF16, -1, &status);
assert(U_SUCCESS(status));

len = ucnv_toUChars(conv, uTarget, 100, germanSCSU, -1, &status);

/* close the converter */
ucnv_close(conv);