Preparsed UCD

What

A text file with preparsed UCD (Unicode Character Database) data.

Preparser script: tools/unicode/py/preparseucd.py
ppucd.txt output: icu4c/source/data/unidata/ppucd.txt (raw text version)
Parser for ppucd.txt: icu4c/source/tools/toolutil/ppucd.h & .cpp
genprops tool rewritten to use that: tools/unicode/c/genprops

Syntax

# Preparsed UCD generated by ICU preparseucd.py

Only whole-line comments starting with #, no inline comments.

ucd;10.0.0

Data lines start with a type keyword. Data fields are semicolon-separated. The number of fields per line is highly variable.

The ucd line should be the first data line. It provides the Unicode version number.

property;Binary;Alpha;Alphabetic
property;Enumerated;bc;Bidi_Class

Property lines define properties with a type and two or more aliases.

binary;N;No;F;False
binary;Y;Yes;T;True
value;bc;ON;Other_Neutral

Property value lines define the values of enumerated and catalog properties, with the property short name and two or more aliases for each value.

There is only one shared definition of the values and aliases for binary properties.

defaults;0000..10FFFF;age=NA;bc=L;blk=NB;bpt=n;cf=<code point>;dm=<code point>;dt=None;ea=N;FC_NFKC=<code point>;gc=Cn;GCB=XX;gcm=Cn;hst=NA;InPC=NA;InSC=Other;jg=No_Joining_Group;jt=U;lb=XX;lc=<code point>;NFC_QC=Y;NFD_QC=Y;NFKC_CF=<code point>;NFKC_QC=Y;NFKD_QC=Y;nt=None;SB=XX;sc=Zzzz;scf=<code point>;scx=<script>;slc=<code point>;stc=<code point>;suc=<code point>;tc=<code point>;uc=<code point>;vo=R;WB=XX

After the version, property, and property value lines, and before other data lines, the defaults line defines default values for all code points (corresponding to @missing data in the UCD). Any properties not mentioned here default to null values according to their type, such as False or the empty string.

The general syntax of this line is the same as for the following data lines:

Line type keyword.
Code point or start..end range (inclusive end).
Zero or more property values.
- Binary values are given by their property name alone if True (“Alpha”), or with a minus sign prepended (“-Alpha”).
- Other values are given as “pname=value” pairs, where pname is the property name.
- In the ppucd.txt file, short names of properties and values are used, but parsers should be prepared to accept any of the aliases according to the earlier sections of the file.
- In the ppucd.txt file, properties are listed in sorted order, but this is not required by the syntax.

block;20000..2A6DF;age=3.1;Alpha;blk=CJK_Ext_B;ea=W;gc=Lo;Gr_Base;IDC;Ideo;IDS;lb=ID;SB=LE;sc=Hani;UIdeo;vo=U;XIDC;XIDS
# 20000..2A6D6 CJK Unified Ideographs Extension B
algnamesrange;20000..2A6D6;han;CJK UNIFIED IDEOGRAPH-
cp;20001;nt=Nu;nv=7
cp;20064;nt=Nu;nv=4
unassigned;2A6D7..2A6DF;ea=W;lb=ID;vo=U
# No block
unassigned;2A6E0..2A6FF;ea=W;lb=ID;vo=U
algnamesrange;AC00..D7A3;hangul

Block lines specify a Unicode Block and provide an opportunity for compact data lines for ranges inside the block, by listing common property values once for the whole block. Block properties override the defaults for cp and unassigned lines with code point ranges inside the block. The file syntax and parser do not require the presence of block lines.

cp lines provide the data for a code point or range. They override the default+block properties. Properties that are not mentioned fall back to the block, then to the defaults.

Unassigned lines (new in ICU 60 for Unicode 10) provide the data for an unassigned code point or range (gc=Cn). They override only the default properties, except for the blk=Block property (if the range is inside a block). Properties that are not mentioned fall back to the defaults, except that the blk=Block property applies to unassigned lines as well.

A range is considered inside a block if it is fully inside the range of the last defined block. Otherwise it is considered outside a block and falls back only to the defaults. This is the case even if the range is inside an earlier block, to simplify parsing & processing (such data lines should be avoided).

A range inside the block for which there is no data line inherits all of the default+block properties (see Han blocks). Note that this is very different from the behavior of an unassigned line, in particular since such blocks typically default to gc!=Cn.

Non-default properties for unassigned ranges inside and outside of blocks are typically for complex defaults and for noncharacters.

ppucd.txt data lines are in code point order, although this should not be strictly required.

Assigned characters normally have their unique na=Name property value. For Hangul syllables with their algorithmically computed names, the entire range is covered by the line “algnamesrange;AC00..D7A3;hangul”. For ranges of ideographic characters, a line like “algnamesrange;20000..2A6D6;han;CJK UNIFIED IDEOGRAPH-“ provides a Name prefix which is to be followed by the code point (in hex like %04lX).

Why not UCD .txt files?

See UAX #44 “Unicode Character Database”

Nontrivial parsing:

The UCD has grown from a couple of semicolon-delimited files plus an informative “Property dump” (early PropList.txt) to a collection of dozens of files with a variety of (now more regular) formats.
Related properties are scattered over several files.
Full information for Numeric_Value and Numeric_Type requires parsing two files.
Default values are “hidden” in comments.
The UCD folder structure (which file where) has changed over time.
UCD filenames change during each Unicode beta period. (A detailed version number is inserted into each filename.)
Many files are bloated with comments that show the General Category and name of each character or range start/end; if the data were combined into a single file, then all properties for a character or range would be listed together, without need for such comments.

Nontrivial patching: Adding characters (e.g., PUA or proposed/draft) requires adding data in many of the UCD files.

ICU already preprocesses some of the UCD .txt files. We strip comments from some files (because they are huge) and in some files merge adjacent same-property code points into ranges.

Some changes are manual, such as updating and adding ranges of algorithmic character names.

Then we run several tools, most of them twice, to parse different sets of .txt files and write several output files. We use several Python and shell scripts, and a “log” (unidata/changes.txt) with details of what was changed and run in each Unicode version upgrade.

Markus has done ICU Unicode updates since about 2002. Someone else might have a hard time picking this up for maintenance and future Unicode version updates.

Why not UCD XML files?

See UAX #42 “Unicode Character Database in XML”

Good: The UCD XML file format stores all properties in a single file with a relatively simple structure, with property values as XML attributes.

Issues:

Missing data which is needed for ICU
- Name_Alias added in UCD 5.0 but missing in UCD XML as of UCD 6.1 beta.
- Script_Extensions added in UCD 6.0 but not “blessed” as a Unicode property as of UCD 6.1. Useful, used in ICU, but not available in UCD XML.
- Adopting UCD XML would require to either still also parse some UCD .txt files or write another tool to merge more data into the XML.
Dependency on third party
- Lag time between UCD .txt vs. XML availability during beta.
- Unable to fix/update/extend XML generator tools.
- For new properties, need to wait for standardization (UAX #42), tool update, and XML publication.
- Will not support custom/nonstandard data.
Could be simpler: Parsing XML is easy in Java, Python, etc. and doable in C++ (we have a “poor man’s” XML parser), but not as easy as line.split(";").
- There is no need for complex structure for the UCD.
Could be easier to read for humans: By not storing defaults for all of Unicode in one place, each <group> carries them, making it hard to see which values are specific to each group. “Fluffy” XML makes for longer text lines, more horizontal scrolling.
Hard to diff: The XML format can be used in different ways, and Unicode publishes different forms of the same data. Also, the precise XML text depends on the XML formatting code used.
- For diffing, a special tool needs to be run, parse old & new XML data, compare values and generate a diff report. Unicode publishes some of those too.
Some data still requires nontrivial parsing.
- For algorithmic character names, the range needs to be determined by collecting a contiguous sequence of elements with a shared name pattern. There is not even any special notation for the algorithmic names for Hangul syllables.
Minor: Unnecessary data (for ICU)
- Precomputed Hangul syllable names
- Irrelevant contributory properties like “Other_Xyz”
- Properties not used by ICU
Minor, just awkward: Blocks are treated as auxiliary data, rather than as a core means to organize and store the data. On the other hand, the “grouped” XML files also use them as the basis for the <group> elements and associated compaction. (The “flat” files don’t.)

Goals

Single file with all data relevant for ICU.
Very easy to parse and use the data in C/C++ tools.
Easily human readable.
Easy-to-read diffs from standard diff tools.
Compact file format.
Conversion tool easy to write, maintain, extend.
Convert from UCD .txt files because those are maintained directly by the UTC & editorial committee. No waiting for third party to convert the files.
Able to extend for new kinds of data.
Easy format for manual data fixes/additions (e.g., PUA or proposed/draft).
Move much of the parsing from scattered C code into one Python script.

Details

All-Unicode defaults in one place, but only list non-null default values. (blk=No_Block, cf=<code point>, ...)
Line-oriented, always semicolon-separated, with type-of-line in the first field.
Block properties override defaults; only for few properties where properties in the block have common, non-default values.
- Effective because blocks represent actual allocation & organization of Unicode. Maintained by UTC.
Code point/range properties override default+block properties.
Algorithmic names stored as ranges with type & shared name prefixes (for CJK).
No gratuitous white space or syntax characters.
Mostly key=value, simpler format for binary properties. Easy to read.
Comment lines with headings from NamesList.txt further improve readability. (There are few of them, so no significant size bloat.)
Simple, stable file generation allows diffing.
- E.g., list properties in sorted order of property names.
No need to implement/store properties that are not used in ICU. (But format & tool are easy to extend.)

Plan

(done) Write Python tool to preparse UCD .txt files and generate one output ppucd.txt file.
(done) Subsume existing ucdcopy.py.
(done) Write toolutil C++ parser for ppucd.txt, add ppucd.txt to the unidata folder.
(done) Merge genbidi, gencase, gennames, gennorm into genprops
- Replace scattered many-.txt parsers with calls to the toolutil ppucd.txt parser.
- Generate all output files in one genprops invocation.
- Update makeprops.sh (delete half of it) & changes.txt.
(done) Make preparseucd.py also parse uchar.h & uscript.h and write the property names data header file. (was: ~~Change genpname/preparse.pl to read ppucd.txt rather than Property[Value]Aliases.txt.~~)
(done) Consider changing pnames_data.h so that minor changes don’t change most of the file contents.
(done) Write wiki/Markus/ReviewTicket8972 with diff links.
- 2019-sep-27: The old Trac server is going away. I copied the wiki page contents into a comment on ICU-8972.
Move UCD tests from cintltst to intltest, change to use the toolutil ppucd.txt parser. (ticket #9041)
Change Java UCD tests to parse & use ppucd.txt. (ticket #9041)
(partially done) Change Python preparser to not copy input UCD .txt files any more, delete them from unidata & Java. (ticket #9041)

Other tool improvements

Bad: Until ICU 4.8, the process is

build & install ICU -> build Unicode tools -> run genpname -> build & install ICU (now with updated property names) -> build Unicode tools -> run UCD parsers -> build & install ICU (now also with case properties & normalization etc.) -> build Unicode tools -> run genuca -> build & install ICU

It should be possible to

merge the Unicode tools into one binary
parameterize the relevant properties code (property name lookup, case & some other properties, NFC)
inject newly built data into the common library for the next part of the merged Unicode tool’s processing.

ICU 49:

build & install ICU -> build Unicode tools -> run genprops -> build & install ICU (now with updated properties) -> build Unicode tools -> run genuca -> build & install ICU

genprops builds the property (value) names data and injects it into the live ppucd.txt parser for further processing.

Goal:

build & install ICU -> build Unicode tool -> run it -> build & install ICU (now with all updated Unicode data)

Requires ticket #9040, could be “hard”.