Conversion Data

Contents

  1. Introduction
    1. Algorithmic vs. Data-based
    2. Stateful vs. Stateless
    3. Scope of this chapter
  2. ICU Mapping Table Data Files
    1. Overview
    2. .ucm File Format
    3. State table syntax in .ucm files
    4. Extension and delta tables
      1. Converting multiple characters as a unit
      2. Delta (extension-only) conversion table files
      3. Other enhancements
    5. Examples for codepage state tables

Introduction

Algorithmic vs. Data-based

In a comprehensive conversion library, there are three kinds of codepage converter implementations: converters that use algorithms, mapping data, or those converters that use both.

  1. Most codepages have a simple and straightforward structure but have an arbitrary relationship between input and output character codes. Mapping tables are necessary to define the conversion. If the codepage characters use more than one byte each, then the mapping table must also define the structure of the codepage.

  2. Algorithmic converters work by transforming the input stream with built-in algorithms and possibly small, hard coded tables. The conversion can be complex, but the actual mapping of a character code is done numerically if the converter is purely algorithmic.

  3. In some cases, a converter needs to be algorithmic for its basic operations but also relies on mapping data.

ICU provides converter implementations for all three groups of codepages. Since ICU always converts, to or from Unicode, the purely algorithmic converters are the ones for Unicode encodings (such as UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, UTF-32LE, SCSU, BOCU-1 and UTF-7). Since Unicode is based on US-ASCII and ISO-8859-1 (“ISO Latin-1”), these encodings also use algorithmic converters for performance reasons.

Most other codepages use simple byte sequences but are not encodings of Unicode. They are converted with generic code using mapping data tables. ICU also supports a few encodings, like ISO-2022 and its variants, that employ an algorithmic structure to switch between a set of codepages. The converters for these encodings are algorithmic but use mapping tables for the embedded codepages.

Stateful vs. Stateless

Character encodings are either stateful or stateless:

  1. Stateless encodings define a byte sequence for each character. Complete character byte sequences can be used in any order, and the same complete character byte sequences always encodes the same characters. It is preferable to always encode one character using the same byte sequence.

  2. Stateful encodings define byte sequences that change the state of the text stream. Depending on the current state, the same byte sequence may encode a different character and the same character may be encoded with different byte sequences.

This distinction between stateless and stateful encodings is important, because it determines if any available ICU converter implementation is used. The following are some more important considerations related to stateless versus stateful encodings:

  1. A runtime converter object is always stateful, even for “stateless” encodings. They are always stateful because an input buffer may end with a partial byte sequence that is to be continued in the next input buffer in the following conversion call. The information about this is stored in the converter object. Similarly, if the input is Unicode text, then an input buffer may end with the first of a pair of surrogates. The converter object also stores overflow bytes or code units if the result of a character mapping did not fit entirely into the output buffer.

  2. Stateless encodings are stateful in our converter implementation to interpret “complete byte sequences”. They are “stateful” because many encodings can have the same byte value used in different positions of byte sequences for different characters; a specific byte value may be a lead byte or a trail byte. For instance, the lead and trail byte values overlap in codepages like Shift-JIS. If a program does not start reading at a character boundary, it may instead interpret the byte sequences from two or more separate characters as one character. Often, character boundaries can be detected reliably only by reading the non-Unicode text linearly from the beginning. This can be a problem for non-Unicode text processing, where text insertion, deletion, and searching are common. The UTF-8/16/32 encodings do not have this problem because the single, lead, or trail units have disjoint values and character boundary can be easily found.

  3. Some stateful encodings only switch between two states: one with one byte per character and one with two bytes per character. This type of encoding is very common in mainframe systems based on Extended Binary Coded Decimal Interchange Code (EBCDIC) and is actually handled in ICU with almost the same code and type of mapping tables as stateless codepages.

  4. The classifications of algorithmic vs. data-based converters and of stateless vs. stateful encodings are independent of each other: UTF-8, UTF-16, and UTF-32 encodings are algorithmic but stateless; UTF-7 and SCSU encodings are algorithmic and stateful; Windows-1252 and Shift-JIS encodings are data-based and stateless; ISO-2022-JP encoding is algorithmic, data-based, and stateful.

Scope of this chapter

The following sections in this chapter discuss the mapping data tables that are used in ICU. For related material, please see:

  1. ICU character set collection

  2. Unicode Technical Report 22

  3. “Cross Mapping Tables” in Unicode Online Data

ICU Mapping Table Data Files

Overview

As stated above, most ICU converters rely on character mapping tables. ICU 1.8 has one single data structure for all character mapping tables, which is used by a generic Multi-Byte Character Set (MBCS) converter implementation. The implementation is flexible enough to handle stateless encodings with the following parameters:

  1. Support for variable-length, byte-based encodings with 1 to 4 bytes per character.

  2. Support for all Unicode characters (code points 0..0x10ffff). Since ICU 1.8 uses the UTF-16 encoding as its Unicode encoding form, surrogate pairs are completely supported.

  3. Efficient distinction between unassigned (unmappable) and illegal byte sequences.

  4. It is not possible to convert from Unicode to byte sequences with leading zero bytes.

  5. Simple stateful encodings are also handled using only Shift-In and Shift-Out (SI/SO) codes and one single-byte and one double-byte state.

:point_right: Note: In the context of conversion tables, “unassigned” code points or codepage byte sequences are valid but do not have a mapping. This is different from “unassigned” code points in a character set like Unicode or Shift-JIS which are codes that do not have assigned characters.

Prior to version 1.8, ICU used more specific, more limited, converter implementations for Single Byte Character Set (SBCS), Double Byte Character Set (DBCS), and the stateful Extended Binary Coded Decimal Interchange Code (EBCDIC) codepages. Mapping table data is provided in text files. ICU comes with several dozen .ucm files (UniCode Mapping, in icu/source/data/mappings/) that are translated at build time by its makeconv tool (source code in icu/source/tools/makeconv). The makeconv tool writes one binary, memory-mappable .cnv file per .ucm file. The resulting .cnv files are included by default in the common data file for use at runtime.

The format of the .ucm files is similar to the format of the UPMAP files as provided by IBM® in the codepage repository and as used in the uconvdef tool on AIX. UPMAP is a text file that specifies the mapping of a codepage character to and from Unicode.

The format of the .cnv files is ICU-specific. The .cnv file format may change between ICU versions even for the same .ucm files. The .ucm file format may be extended to include more features.

The following sections concentrate on the .ucm file format. The .cnv file format is described in the source code in the icu/source/common/ucnvmbcs.c directory and is updated using the MBCS converter implementation.

These conversion tables can have more than one name. ICU allows multiple names (“aliases”) for the same encoding. It matches a requested encoding name against a list of names in icu/source/data/mappings/convrtrs.txt and when it finds a match, ICU opens a converter with the name in the leftmost position in the matching line. The name matching is not case-sensitive and ICU ignores spaces, dashes, and underscores. At build time, the gencnval tool located in the icu/source/tools/gencnval directory, generates a binary form of the convrtrs.txt file as a data file for runtime for the cnvalias.icu file (“Converter Aliases data file”).

.ucm File Format

.ucm files are line-oriented text files. Empty lines and comments starting with ‘#’ are ignored.

A .ucm file contains two sections:

  1. a header with general specifications of the codepage

  2. a mapping table section between the “CHARMAP” and “END CHARMAP” lines.

For example:

<code_set_name>               "IBM-943"
<char_name_mask>              "AXXXX"
<mb_cur_min>                  1
<mb_cur_max>                  2
<uconv_class>                 "MBCS"
<subchar>                     \xFC\xFC
<subchar1>                    \x7F
<icu:state>                   0-7f, 81-9f:1, a0-df, e0-fc:1
<icu:state>                   40-7e, 80-fc
#
CHARMAP
#
#
#ISO 10646      IBM-943
#_________      _________
<U0000> \x00 |0
<U0001> \x01 |0
<U0002> \x02 |0
<U0003> \x03 |0
...
<UFFE4> \xFA\x55 |1
<UFFE5> \x81\x8F |0
<UFFFD> \xFC\xFC |2
END CHARMAP

The header fields are:

  1. code_set_name - The name of the codepage. The makeconv tool generates the .cnv file name from the .ucm filename but uses this header field for the converter name that it writes into the .cnv file for ucnv_getName. The makeconv tool prints a warning message if this header field does not match the file name. The file name is not case-sensitive.

  2. char_name_mask - This is ignored by makeconv tool. “AXXXX” specifies that the POSIX-style character “name” consists of one letter (Alpha) followed by 4 hexadecimal digits. Since ICU only uses Unicode character “names” (for example, code points) the format is fixed (see below).

  3. mb_cur_min - The minimum number of bytes per character.

  4. mb_cur_max - The maximum number of bytes per character.

  5. uconv_class - This can be either “SBCS”, “DBCS”, “MBCS”, or “EBCDIC_STATEFUL” The most general converter class/type/category is MBCS, which requires that the codepage structure has the following lines. The other types of converters are subsets of MBCS. The makeconv tool uses predefined state tables for these other converters when their structure is not explicitly specified. The following describes how the converter types are interpreted:

    a. MBCS: Generic ICU converter type, requires a state table

    b. SBCS: Single-byte, 8-bit codepages

    c. DBCS: Double-byte EBCDIC codepages

    d. EBCDIC_STATEFUL: Mixed Single-Byte or Double-Byte EBCDIC codepages (stateful, using SI/SO)

The following shows the exact implied state tables for non-MBCS types. A state table may need to be overwritten in order to allow supplementary characters (U+10000 and up).

  1. subchar - The substitution character byte sequence for this codepage. This sequence must be a valid byte sequence according to the codepage structure.

  2. subchar1 - This is the single byte substitution character when subchar is defined. Some IBM converter libraries use different substitution characters for “narrow” and “wide” characters (single-byte and double-byte). ICU uses only one substitution character per codepage because it is common industry practice.

  3. icu:state - See the “State Table Syntax in .ucm Files” section for a detailed description of how to specify a codepage structure.

  4. icu:charsetFamily - This specifies if the codepage is ASCII or EBCDIC based.

The subchar and subchar1 fields have been known to cause some confusion. The following conditions outline when each are used:

  1. Conversion from Unicode to a codepage occurs and an unassigned code point is found

    a. If a subchar1 byte is defined and a subchar1 mapping is defined for the code point (with a |2 precision indicator), output the subchar1

    b. Otherwise output the regular subchar

  2. Conversion from a codepage to Unicode occurs and an unassigned codepoint is found

    a. If the input sequence is of length 1 and a subchar1 byte is specified for the codepage, output U+001A

    b. Otherwise output U+FFFD

In the CHARMAP section of a .ucm file, each line contains a Unicode code point (like <U(1-6 hexadecimal digits for the code point)> ), a codepage character byte sequence (each byte like \xhh (2 hexadecimal digits) ), and an optional “precision” or “fallback” indicator.

The precision indicator either must be present in all mappings or in none of them. The indicator is a pipe symbol | followed by a 0, 1, 2, 3, or 4 that has the following meaning:

  • |0 - A “normal”, roundtrip mapping from a Unicode code point and back.
  • |1 - A “fallback” mapping only from Unicode to the codepage, but not back.
  • |2 - A subchar1 mapping. The code point is unmappable, and if a substitution is performed, then the subchar1 should be used rather than the subchar. Otherwise, such mappings are ignored.
  • |3 - A “reverse fallback” mapping only from the codepage to Unicode, but not back to the codepage.
  • |4 - A “good one-way” mapping only from Unicode to the codepage, but not back.

Fallback mappings from Unicode typically do not map codes for the same character, but for “similar” ones. This mapping is sometimes done if a character exists in Unicode but not in the codepage. To replace it, ICU maps a codepage code to a similar-looking code for human-readable output. This mapping feature is not useful for text data transmission especially in markup languages where a Unicode code point can be escaped with its code point value. The ICU application programming interface (API) ucnv_setFallback() controls this fallback behavior.

“Reverse fallbacks” are technically similar, but the same Unicode character can be encoded twice in the codepage. ICU always uses reverse fallbacks at runtime.

A subset of the fallback mappings from Unicode is always used at runtime: Those that map private-use Unicode code points. Fallbacks from private-use code points are often introduced as replacements for previous roundtrip mappings for the same pair of codes. These replacements are used when a Unicode version assigns a new character that was previously mapped to that private-use code point. The mapping table is then changed to map the same codepage byte sequence to the new Unicode code point (as a new roundtrip) and the mapping from the old private-use code point to the same codepage code is preserved as a fallback.

A “good one-way” mapping is like a fallback, but ICU always uses “good one-way” mappings at runtime, regardless of the fallback API flag.

The idea is that fallbacks normally lose information, such as mapping from a compatibility variant of a letter to the ASCII version; however, fallbacks from PUA and reverse fallbacks are assumed to be for “the same character”, just an older code for it.

Something similar happens with from-Unicode Variation Selector sequences. It is possible to round-trip (|0) either the unadorned character or the sequence with a variation selector, and add a “good one-way” mapping (|4) from the other version. That “good one-way” mapping does not lose much information, and it is used even if the “use fallback” API flag is false. Alternatively, both mappings could be fallbacks (|1) that should be controlled by the “use fallback” attribute.

State table syntax in .ucm files

The conversion to Unicode uses a state machine to achieve the above capabilities with reasonable data file sizes. The state machine information itself is loaded with the conversion data and defines the structure of the codepage, including which byte sequences are valid, unassigned, and illegal. This data cannot (or not easily) be computed from the pure mapping data. Instead, the .ucm files for MBCS encodings have additional entries that are specific to the ICU makeconv tool. The state tables for SBCS, DBCS, and EBCDIC_STATEFUL are implied, but they can be overridden (see the examples below). These state tables are specified in the header section of the .ucm file that contains the <icu:state> element. Each line defines one aspect of the state machine. The state machine uses a table of as many rows as there are states (= as many as there are <icu:state> lines). Each row has 256 entries; one for each possible byte value.

The state table lines in the .ucm header conform to the following Extended Backus-Naur Form (EBNF)-like grammar (whitespace is allowed between all tokens):

row=[[firstentry ','] entry (',' entry)*]
firstentry="initial" | "surrogates"
           (initial state (default for state 0), output is all surrogate pairs)

Each state table row description (that follows the <icu:state>) begins with an optional initial or surrogates keyword and is followed by one or more column entries. For the purpose of codepage state tables, the states=rows in the table are numbered beginning at 0 for the first line in the .ucm file header. The numbers are assigned implicitly by the makeconv tool in order of the <icu:state> lines.

A row may be empty (nothing following the <icu:state>) - that is equivalent to “all illegal” or 0-ff.i and is useful for trail byte states for all-illegal byte sequences.

entry=range [':' nextstate] ['.' [action]]
range     = number ['-' number]
nextstate = number (0..7f)
action    = 'u' | 's' | 'p' | 'i'
                (unassigned, state change only, surrogate pair, illegal)
number    = (1- or 2-digit hexadecimal number)

Each column entry contains at least one hexadecimal byte value or value range and is separated by a comma. The column entry specifies how to interpret an input byte in the row’s state. If neither a next state nor an action is explicitly specified (only the byte range is given) then the byte value terminates the byte sequence, results in a valid mapping to a Unicode BMP character, and resets the state number to 0. The first line with <icu:state> is called state 0.

The next state can be explicitly specified with a separating colon ( : ) followed by the number of the state (=number/index of the row, starting at 0). This specification is mostly used for intermediate byte values (such as bytes that are not the last ones in a sequence). The state machine needs to proceed to the next state and read another byte. In this case, no other action is specified.

If the byte value(s) terminate(s) a byte sequence, then the byte sequence results in the following depending on the action that is announced with a period ( . ) followed by a letter:

letter meaning
u Unassigned. The byte sequence is valid but does not encode a character.
none (no letter) - Valid. If no action letter is specified, then the byte sequence is valid and encodes a Unicode character up to U+ffff
p Surrogate Pair. The byte sequence is valid and the result may map to a UTF-16 encoded surrogate pair
i Illegal. The byte sequence is illegal. This is the default for all byte values in a row that are not otherwise specified with column entries
s State change only. The byte sequence does not encode any character but may change the state number. This may be used with simple, stateful encodings (for example, SI/SO codes), but currently it is not used by ICU.

If an action is specified without a next state, then the next state number defaults to 0. In other words, a byte value (range) terminates a sequence if there is an action specified for it, or when there is neither an action nor a next state. In this case, the byte value defaults to “valid, next state is 0” (equivalent to :0.).

If a byte value is not specified in any column entry row, then it is illegal in the current state. If a byte value is specified in more than one column entry of the same row, then ICU uses the last state. These specifications allow you to assign common properties for a wide byte value range followed by a few exceptions. This is easier than having to specify mutually exclusive ranges, especially if many of them have the same properties.

The optional keyword at the beginning of a state line has the following effect:

keyword effect
initial The state machine can start reading byte sequences in this state. State 0 is always an initial state. Only initial states can be next states for final byte values. In an initial state, the Unicode mappings for all final bytes are also stored directly in the state table.
surrogates All Unicode mappings for final bytes in non-initial states are stored in a separate table of 16-bit Unicode (UTF-16) code units. Since most legacy codepages map only to Unicode code points up to U+ffff (the Basic Multilingual Plane, BMP), the default allocation per mapping result is one 16-bit unit. Individual byte values can be specified to map to surrogate pairs (= two 16-bit units) with action letter p. The surrogates keyword specifies the values for the entire state (row). Surrogate pair mapping entries can still hold single units depending on the actual mapping data, but single-unit mapping entries cannot hold a pair of units. Mapping to single-unit entries is the default because the mapping is faster, uses half as much memory in the code units table, and is sufficient for most legacy codepages.

When converting to Unicode, the state machine starts in state number 0. In each iteration, the state machine reads one input (codepage) byte and either proceeds to the next state as specified, or treats it as a final byte with the specified action and an optional non-0 next (initial) state. This means that a state table needs to have at least as many state rows as the maximum number of bytes per character, which is the maximum length of any byte sequence.

Exception: For EBCDIC_STATEFUL codepages, double-byte sequences start in state 1, with the SI/SO bytes switching from state 0 to state 1 or from state 1 to state 0. See the default state table below.

Extension and delta tables

ICU 2.8 adds an additional “extension” data structure to its conversion tables. The new data structure supports a number of new features. When any of the following features are used, then all mappings must use a precision indicator.

Converting multiple characters as a unit

Before ICU 2.8, only one Unicode code point could be converted to or from one complete codepage byte sequence. The new data structure supports the conversion between multiple Unicode code points and multiple complete codepage byte sequences. (A “complete codepage byte sequence” is a sequence of bytes which is valid according to the state table.)

Syntax: Simply write more than one Unicode code point on a mapping line, and/or more than one complete codepage byte sequence. Plus signs (+) are optional between code points and between bytes. For example, ibm-1390_P110-2003.ucm contains

<U304B><U309A> \xEC\xB5 |0

and test3.ucm contains

<U101234>+<U50005>+<U60006> \x07+\x00+\x01\x02\x0f+\x09 |0

For more examples see the ICU conversion data and the icu/source/test/testdata/test*.ucm test data files.

ICU 2.8 supports up to 19 UChars on the Unicode side of a mapping and up to 31 bytes on the codepage side.

The longest match possible is converted in order to properly handle tables where the source sides of some mappings are prefixes of the source sides of other mappings.

As a side effect, if conversion offsets are written and a potential match crosses buffer boundaries, then some of the initial offsets for the following output may be unknown (-1) because their input was stored in the converter from a previous buffer while looking for a longer match.

Conversion tables for SI/SO-stateful (usually EBCDIC_STATEFUL) codepages cannot include mappings with SI or SO bytes or where there are SBCS characters in a multi-character byte sequence. In other words, for these tables there must be exactly one byte in a mapping or else a sequence of one or more DBCS characters.

Delta (extension-only) conversion table files

Physically, a binary conversion table (.cnv) file automatically contains both a traditional “base table” data structure for the 1:1 mappings and a new “extension table” for the m:n mappings if any are encountered in the .ucm file. An extension table can also be requested manually by splitting the CHARMAP into two. The first CHARMAP section will be used for the base table, and the second only for the extension table. M:n mappings in the first CHARMAP will be moved to the extension table.

In order to save space for very similar conversion tables, it is possible to create delta .cnv files that contain only an extension table and the name of another .cnv file with a base table. The base file must be split into two CHARMAPs such that the base file’s base table does not contain any mappings that contradict any of the delta file’s mappings.

The delta (extension-only) file uses only a single CHARMAP section. In addition, it nees a line in the header that both causes building just a delta file and specifies the name of the base file. For example, windows-936-2000.ucm contains

<icu:base> “ibm-1386_P100-2002”

makeconv ignores all mappings for the delta file that are also in the base file’s base table. If the two conversion tables are sufficiently similar, then the delta file will contain only a relatively small set of mappings, which results in a small .cnv file. At runtime, both the delta file and its base file are loaded, and the base file’s base table is used together with the extension file. The base file works as a standalone file, using its own extension table for its full set of mappings. The base file must be in the same ICU data package as the delta file.

The hard part is to split the base file’s mappings into base and extension CHARMAPs such that the base table does not overlap with any delta file, while all shared mappings should be in the base table. (The base table data structure is more compact than the extension table data structure.)

ICU provides the ucmkbase tool in the ucmtools collection to do this.

For example, the following illustrates how to use ucmkbase to make a base .ucm file for three Shift-JIS conversion table variants. (ibm-943_P15A-2003.ucm becomes the base.)

C:\tmp\icu\ucm>ren ibm-943_P15A-2003.ucm ibm-943_P15A-2003.orig
C:\tmp\icu\ucm>ucmkbase ibm-943_P15A-2003.orig ibm-943_P130-1999.ucm ibm-942_P12A-1999.ucm > ibm-943_P15A-2003.ucm

After this, the two delta .ucm files only need to get the following line added before the start of their CHARMAPs:

<icu:base> "ibm-943_P15A-2003"

The ICU tools and runtime code handle DBCS-only conversion tables specially, allowing them to be built into delta files with MBCS or EBCDIC_STATEFUL base files without using their single-byte mappings, and without ucmkbase moving the single-byte mappings of the base file into the base file’s extension table. See for example ibm-16684_P110-2003.ucm and ibm-1390_P110-2003.ucm.

Other enhancements

ICU 2.8 adds support for the specification of which unassigned Unicode code points should be mapped to subchar1 rather than the default subchar. See the discussion of subchar1 above for more details.

The extension table data structure also removes one minor limitation on ICU conversion tables: Fallback mappings to a single byte 00 are now allowed and handled properly. ICU versions before 2.8 could only handle roundtrips to/from 00.

Examples for codepage state tables

The following shows the exact implied state tables for non-MBCS types, A state table may need to be overwritten in order to allow supplementary characters (U+10000 and up).

US-ASCII

0-7f

This single-row state table describes US-ASCII. Byte values from 0 to 0x7f are valid and map to Unicode characters up to U+ffff. Byte values from 0x80 to 0xff are illegal.

Shift-JIS

0-7f, 81-9f:1, a0-df, e0-fc:1
40-7e, 80-fc

This two-row state table describes the Shift-JIS structure which encodes some characters with one byte each and others with two bytes each. Bytes 0 to 0x7f and 0xa0 to 0xdf are valid single-byte encodings. Bytes 0x81 to 0x9f and 0xe0 to 0xfc are lead bytes. (For example, they are followed by one of the bytes that is specified as valid in state 1). A byte sequence of 0x85 0x61 is valid while a single byte of 0x80 or 0xff is illegal. Similarly, a byte sequence of 0x85 0x31 is illegal.

EUC-JP

0-8d, 8e:2, 8f:3, 90-9f, a1-fe:1
a1-fe
a1-e4
a1-fe:1, a1:4, a3-af:4, b6:4, d6:4, da-db:4, ed-f2:4
a1-fe.u

This fairly complicated state table describes EUC-JP. Valid byte sequences are one, two, or three bytes long. Two-byte sequences have a lead byte of 0x8e and end in state 2, or have lead bytes 0xa1 to 0xfe and end in state 1. Three-byte sequences have a lead byte of 0x8f and continue in state 3. Some final byte value ranges are entirely unassigned, therefore they end in state 4 with an action letter of u for “unassigned” to save significant memory for the code units table. Assigned three-byte sequences end in state 1 like most two-byte sequences.

SBCS default state table:

0-ff

SBCS by default implies the structure for single-byte, 8-bit codepages.

DBCS default state table:

0-3f:3, 40:2, 41-fe:1, ff:3
41-fe
40

Important: These are four states — the fourth has an empty line (equivalent to 0-ff.i)! DBCS codepages, by default, are defined with the EBCDIC double-byte structure. Valid sequences are pairs of bytes from 0x41 to 0xfe and the one pair 0x40/0x40 for the double-byte space. The structure is defined such that all illegal byte sequences are always two in length. Therefore, every byte in the initial state is a lead byte.

EBCDIC_STATEFUL default state table:

0-ff, e:1.s, f:0.s
initial, 0-3f:4, e:1.s, f:0.s, 40:3, 41-fe:2, ff:4
0-40:1.i, 41-fe:1., ff:1.i
0-ff:1.i, 40:1.
0-ff:1.i

This is the structure of Mixed Single-byte and Double-byte EBCDIC codepages, which are stateful and use the Shift-In/Shift-Out (SI/SO) bytes 0x0f/0x0e. The initial state 0 is almost the same as for SBCS except for SI and SO. State 1 is also an initial state and is the basis for a state-shifted version of the DBCS structure above. All double-byte sequences return to state 1 and SI switches back to state 0. SI and SO are also allowed in their own states with no effect.

:point_right: Note: If a DBCS or EBCDIC_STATEFUL codepage maps supplementary (non-BMP) Unicode characters, then a modified state table needs to be specified in the .ucm file. The state table needs to use the surrogates designation for a table row or .p for some entries.

The reuse of a final or intermediate state (shown for EUC-JP) is valid for as long as there is no circle in the state chain. The mappings will be unique because of the different path to the shared state (sharing a state saves some memory; each state table row occupies 1kB in the .cnv file). This table also shows the redefinition of byte value ranges within one state row (State number 3) as shorthand. State 3 defines bytes a1-fe to go to state 1, but the following entries redefine and override certain bytes to go to state 4.

An initial state never needs a surrogates designation or .p because Unicode mapping results in initial states that are stored directly in the state table, providing enough room in each cell. The size of a generated .cnv mapping table file depends primarily on the number and distribution of the mappings and on the number of valid, multi-byte sequences that the state table allows. Each state table row takes up one kilobyte.

For single-byte codepages, the state table cells contain all two-Unicode mappings. Code point results for multi-byte sequences are stored in an array with enough room for all valid byte sequences. For all byte sequences that end in a surrogates or .p state, Unicode allocates two code units.

If possible, valid state table entries may be changed to .u to reduce the number of valid, assignable sequences and to make the .cnv file smaller. If additional states are necessary, then each additional state itself adds 1kB to the file size, diminishing the file size savings. See the EUC-JP example above.

For codepages with up to two bytes per character, the makeconv tool automatically compacts the bytes, if possible, by introducing one more trail byte state. This state replaces valid entries in the original trail state with unassigned entries and changes each lead byte entry to work with the new state if there are no mappings with that lead byte.

For codepages with up to three or four bytes per character, compaction must be done manually. However, if the verbose option is set on the command line, the makeconv tool will print useful information about unassigned byte sequences.