This file provides instructions for building and running the UnicodeTools, which can be used to:
[!CAUTION]
- This is NOT production level code, and should never be used in programs.
- The API is subject to change without notice, and will not be maintained.
- The source is uncommented, and has many warts; since it is not production code, it has not been worth the time to clean it up.
- It will probably need some adjustments on Unix or Windows, such as changing the file separator.
- Currently it uses hard-coded directory names.
- The contents of multiple versions of the UCD must be copied to a local directory, as described below.
- It will be useful to look at the history of the files in git to see the kinds of rule changes that are made!
- Unfortunately, we lost some change history of about 1.5 years(?) leading up to April 2020.
Some of the tasks within the Unicode Tools generate output files that can also be input files into other steps.
For this purpose, we create a folder named Generated to store these files.
This folder can be a subfolder inside the local working copy root (called an “In-source build” workspace layout), or this folder can be outside (ex: a sibling folder) the local working-copy root (called an “Out-of-source build” workspace layout). Both workspace styles are described below.
Out-of-source builds keep a separation between source files of the repository and their generated output files, which are not tracked in the repository. Out-of-source builds allow developers to maintain a clean view of changes to tracked source files, without mixing generated output files. (Out-of-source builds are also useful for C++ repositories in which multiple configurations can be invoked to generate independent sets of makefiles that result in corresponding different output compiled binary files.)
mkdir -p unicodetools/mine/src
mkdir -p cldr/mine/src
git clone https://github.com/unicode-org/unicodetools.git unicodetools/mine/src
git clone https://github.com/unicode-org/cldr.git cldr/mine/src
Generated folder structure as a sibling to the local working copy root:
mkdir -p unicodetools/mine/Generated/BIN
git clone https://github.com/unicode-org/unicodetools.git
git clone https://github.com/unicode-org/cldr.git
unicodetools local working copy, create the Generated/BIN folder structure
cd unicodetools; mkdir -p Generated/BINCurrently, some tests run on the generated output files of a tool (ex: in order to test the validity of the output files). After converting these tests into standard JUnit tests, these unit tests are then run in isolation by default. Our code has been updated to support this behavior because it now checks for generated files in the Generated directory, and falls back to the repository’s checked-in version when a command does not invoke the generation of a new version.
(Note: The following example values for Java system properties are paths to local working copies that are organized using the out-of-source build workspace layout, as described above.)
| Property | Example Value |
|---|---|
| CLDR_DIR | /usr/local/google/home/mscherer/cldr/mine/src |
| IMAGES_REPO_DIR | /usr/local/google/home/mscherer/images/mine/src |
| UNICODETOOLS_REPO_DIR | /usr/local/google/home/mscherer/unitools/mine/src |
| UNICODETOOLS_GEN_DIR | /usr/local/google/home/mscherer/unitools/mine/Generated |
| UVERSION | 14.0.0 |
If you have not correctly configured your local settings file according to http://cldr.unicode.org/development/maven, then you will likely see error messages when your Maven build tries to access (including just reading) the Maven dependency artifacts such as:
[ERROR] Failed to execute goal on project unicodetools-testutils: Could not collect dependencies for project org.unicode.unicodetools:unicodetools-testutils:jar:1.0.0
[ERROR] Failed to read artifact descriptor for com.ibm.icu:icu4j:jar:78.0.1-SNAPSHOT
[ERROR] Caused by: The following artifacts could not be resolved: com.ibm.icu:icu4j:pom:78.0.1-20250916.173842-8 (present, but unavailable): Could not transfer artifact com.ibm.icu:icu4j:pom:78.0.1-20250916.173842-8 from/to github (https://maven.pkg.github.com/unicode-org/cldr): authentication failed for https://maven.pkg.github.com/unicode-org/cldr/com/ibm/icu/icu4j/78.0.1-SNAPSHOT/icu4j-78.0.1-20250916.173842-8.pom, status: 401 Unauthorized
[ERROR] Failed to read artifact descriptor for org.unicode.cldr:cldr-code:jar:0.0.0-SNAPSHOT-3404124632
[ERROR] Caused by: The following artifacts could not be resolved: org.unicode.cldr:cldr-code:pom:0.0.0-SNAPSHOT-3404124632 (present, but unavailable): Could not transfer artifact org.unicode.cldr:cldr-code:pom:0.0.0-SNAPSHOT-3404124632 from/to github (https://maven.pkg.github.com/unicode-org/cldr): authentication failed for https://maven.pkg.github.com/unicode-org/cldr/org/unicode/cldr/cldr-code/0.0.0-SNAPSHOT-3404124632/cldr-code-0.0.0-SNAPSHOT-3404124632.pom, status: 401 Unauthorized
This happens because Github’s Maven artifact registry requires authentication, via a Github token, even when you are only reading artifacts.
[!IMPORTANT] If you run into this issue, then:
- revisit the CLDR Maven setup link to ensure that you have created the correct type of Github access token: classic, not fine-grained—with a fine-grained token, you will get a 403 rather than a 401—; and that you have granted the correct permissions to it.
- if you are copy-and-pasting any of the example Unicode Tools task Maven commands from the instructions below and/or from the CI workflow files, then you must remove the
-s .github/workflows/mvn-settings.xmlthat is needed only for CI. Removing the-soption will use the settings in your default local settings file at~/.m2/settings.xml.
Like other projects, Unicode Tools uses a source formatter to ensure a consistent code style automatically, and it uses a single common formatter to avoid spurious diff noise in code reviews. This is now enforced via a formatter that is configured in the Maven build via a Maven plugin and checked by continuous integration on pull requests.
When creating pull requests, you can check the formatting locally using the command mvn spotless:check. You can apply the formatter’s changes using the command mvn spotless:apply. Continuous integration errors for formatting can be fixed by committing the changes resulting from applying the formatter locally and pushing the new commit.
Some IDEs can integrate the formatter via plugins, which can minimize the need to manually run the formatter separately. The following links for specific IDEs may work:
android-formatting.xml.android-formatting.xml link mentioned for Eclipse (ex: "java.format.settings.url": "https://raw.githubusercontent.com/aosp-mirror/platform_development/master/ide/eclipse/android-formatting.xml",). Also use the profile name corresponding to that XML file: (ex: "java.format.settings.profile": "Android",).Maven and Existing Maven Projects don’t appear as a top-level category and sub-option in the initial Import screen of the wizard, then the Eclipse plugin for Maven support has not been installed yet, and see above.Build and Test${workspace_loc:/unicodetools-parent}package-ea)For the tools to work, you need to set the JVM system properties according to your workspace layout. Depending on which tool you are running, you may need some or all of the properties listed above in General Setup for Maven.
For command-line users:
-Dvar1=path1 -Dvar2=path2 ...For Eclipse users:
-Dvar1=path1 -Dvar2=path2 ....
Please also enable assertions when running commands so that failed assertions don’t just slip through.
Command-line users:
MAVEN_OPTS environment variable to include the -ea JVM option in its string value
export MAVEN_OPTS="-ea"; mvn compile exec:java -Dexec.mainClass=...MAVEN_OPTS="-ea" mvn compile exec:java -Dexec.mainClass=...Eclipse users:
-ea (enable assertions) in your Preferences or in your Run/Debug configurationsAll commands must be run in the root of the unicodetools repository local working copy directory.
Common tasks for Unicode Tools are listed below with example CLI commands with example argument values that they need:
Out-of-source build: mvn -s .github/workflows/mvn-settings.xml compile exec:java -Dexec.mainClass="org.unicode.text.UCD.Main" -Dexec.args="version 14.0.0 build MakeUnicodeFiles" -am -pl unicodetools -DCLDR_DIR=$(cd ../../../cldr/mine/src ; pwd) -DUNICODETOOLS_GEN_DIR=$(cd ../Generated ; pwd) -DUNICODETOOLS_REPO_DIR=$(pwd) -DUVERSION=14.0.0
In-source build: MAVEN_OPTS="-ea" mvn compile exec:java -Dexec.mainClass="org.unicode.text.UCD.Main" -Dexec.args="version 14.0.0 build MakeUnicodeFiles" -am -pl unicodetools -DCLDR_DIR=$(cd ../cldr ; pwd) -DUNICODETOOLS_GEN_DIR=$(cd Generated; pwd) -DUNICODETOOLS_REPO_DIR=$(pwd) -DUVERSION=14.0.0
Out-of-source build: MAVEN_OPTS="-ea" mvn package -DCLDR_DIR=$(cd ../../../cldr/mine/src ; pwd) -DUNICODETOOLS_GEN_DIR=$(cd ../Generated ; pwd) -DUNICODETOOLS_REPO_DIR=$(pwd) -DUVERSION=14.0.0
In-source build: MAVEN_OPTS="-ea" mvn package -DCLDR_DIR=$(cd ../cldr ; pwd) -DUNICODETOOLS_GEN_DIR=$(cd Generated; pwd) -DUNICODETOOLS_REPO_DIR=$(pwd) -DUVERSION=14.0.0
See the corresponding Github Actions Continuous Integration workflow file to see other commonly used tools and specifics on how to invoke them at the command line.
For each individual command in Unicode Tools described above, you can configure a Launch Configuration in one of two ways.
UCD Make Unicode Files)-am -pl unicodetools compile exec:java (the argument for the subproject list flag -pl assumes that the class with the main method is in the subdirectory unicodetools/src/main/java)exec.mainClass, value = "org.unicode.text.UCD.Main"; name = exec.args, value = "version 15.0.0 build MakeUnicodeFiles")UCD Make Unicode Files)unicodetoolsorg.unicode.text.UCD.Main)version 15.0.0 build MakeUnicodeFiles)-ea)Error: Could not find or load main class org.unicode.text.UCD.Main Caused by: java.lang.ClassNotFoundException ..., then you must run the Build and Test run config for Maven to build the yet-uncompiled Java classes into ./unicodetools/target/classes:point_right: Note: This is a mess. See https://unicode-org.atlassian.net/browse/ICU-21757
See the top level pom.xml under <properties>.
icu.version in the top level pom.xml to the version string, such as 70.0.1-SNAPSHOT-cldr-2021-09-15
cldr.version in the top level pom.xml to this version string, which has 0.0.0 and a git hash in it, such as 0.0.0-SNAPSHOT-bfa39570bepom.xml40.0-SNAPSHOTcldr.version to 40.0-SNAPSHOT and this version will be used.The input data files for the Unicode Tools are checked into the repo since 2012-dec-21:
This is inside the unicodetools file tree, and the Java code has been updated to assume that. Any old Eclipse setup needs its path variables checked.
For details see Input data setup.
To generate new data files, you can run the org.unicode.text.UCD.Main class
(yes, the Main class has a main() function)
with program arguments build MakeUnicodeFiles. You may optionally include e.g.
version 14.0.0 if you wish to just generate the files for a single version.
Make sure you have the VM arguments set up as described above.
Starting with Unicode 15, we are developing most of the Unicode data files in this Unicode Tools project, and publish them to the Public folder only for alpha/beta/final releases. That is, we are reversing the flow of files.
See data workflow. (Based on issue #144.)
We are also no longer generating and posting files with version suffixes. (We now generate files into an output folder with the Unicode version number.)
Except: Some files, such as Unihan and ucdxml data files, are developed elsewhere, and we continue to ingest them as before.
Starting with Unicode 15, we keep the latest versions of data files in unversioned “dev” folders in this repo.
See data workflow.
All of the following have version 15.0.0 (or whatever the latest version is)
in the options given to Java.
Example changes for adding Unicode 15 version numbers:
See the second commit of https://github.com/unicode-org/unicodetools/pull/156. Also, you must update the version number in the CI build scripts in .github/workflows/.
Example changes for adding properties: https://github.com/unicode-org/unicodetools/pull/40. Throughout these steps we will walk through updating unicodetools to support Unicode 15 or 14.
Firstly, fetch the latest data files for this version from https://www.unicode.org/Public/14.0.0/ucd/, matching your new version number. If this does not exist, request this be created from ucd-dev@googlegroups.com. You may also need to fetch the emoji files from https://www.unicode.org/Public/emoji/13.1, using a previous version if a new one does not exist.
You may need to use the tools from Input data setup to
desuffix the files (removing the -dN suffixes). Copy these into
unicodetools/data/emoji/14.0 and unicodetools/data/ucd/14.0.0-Update.
to set up the inputs correctly. For some updates you may need to pull in other (uca, security, idna, etc) files, see Input data setup for more information.
Now, update the following files:
MakeUnicodeFiles.txt (find in Eclipse via Navigate/Resource or Ctrl+Shift+R)
Generate: .*
CopyrightYear: 2021 (or whatever)
....
File: DerivedAge
..... add a value for the latest version at the bottom:
Value: V14_0
Update String[] LONG_AGE and String[] SHORT_AGE in UCD_Names.java.
Update latestVersion and lastVersion in org.unicode.text.utility.Settings.java to fix:
public static final String latestVersion = "14.0.0";
public static final String lastVersion = "13.1.0"; // last released version
Update LIMIT_AGE and AGE_VERSIONS in UCD_Types.java.
Update enum AGE_Values in UcdPropertyValues.java.
Update searchPath in org.unicode.text.utility.Utility.java.
If there are new CJK characters
(if there are changes to entries in UnicodeData.txt that are for <CJK Ideograph ..., First> etc.),
UCD.java and UCD_Types.java need to be updated to handle these ranges.
See PR #171
and PR #47 for examples.
For CJK, you’ll first need to compute the composite version, as (major << 16) | (minor << 8) | update.
E.g. Unicode 14 is 0xe0000.
Since the ranges change based on the version, the code here needs to be updated in a version-aware way.
If any range has changed its end point, say, CJK Extension C, update CJK_C_LIMIT in UCD_Types.java
(make sure to update the comment next to it with the latest Unicode version).
Then edit mapToRepresentative() in UCD.java to add the range. Make sure the
range is added only for the latest Unicode version, by using sections like
if (ch <= 0x2B737 && rCompositeVersion >= 0xe0000) .
If a new range has been introduced, add it to UCD_Types.java near CJK_E_BASE,
add it to mapToRepresentative(), update hasComputableName and get() in UCD.java
to add the first character.
Also search (case-insensitively) unicodetools for 2A700 (start of Extension C) and add the new range accordingly.
When CJK_LIMIT moves, search for 9FCC and update near there as necessary.
If the main Tangut block has been extended, then in UCD.java
mapToRepresentative() add another per-version block for returning TANGUT_BASE.
You can now run the steps in “Generating new data” above to attempt to generate the files. It will likely error due to missing enum values for new blocks and scripts.
Compare Blocks.txt to the old version (or check the errors from your attempt to generate new files). For all the new ones:
ShortBlockNames.txt (you need to know what the short name is, you can
find it in PropertyValueAliases.txt)UcdPropertyValues.java enum Block_Values
ShortBlockNames and
see if you still get errors.UcdPropertyValues.java enum Script_Values, in
alphabetical orderUCD_Types.java below SCRIPT_CODE, in alphabetical
order grouped by Unicode version. Update LIMIT_SCRIPT to use the name of the
new last scriptSCRIPT and LONG_SCRIPT in UCD_Names.java, in alphabetical order
grouped by Unicode version. (Important: this must be in the same order as the
previous one.)DerivedAge.txt lines for the new
version, copy them into the input Scripts.txt file, and change the new
version number to the appropriate script (which can be new or old or Common
etc.). Then run UCD Main again and check the generated Scripts.txt.Make a pull request to incorporate these updates, and upload the generated files in a way that can be shared with ucd-dev.
Unicode 15+:
… instead of posting draft files elsewhere and re-ingesting them later.
Ideally, diff the files to check for any discrepancies. The script will do this
automatically, you can search the output for lines that say “Found difference in
<filename>”, however note that it will only display the first line of the diff,
so if there are additional discrepancies you may miss them.
When you run, it will break if there are new enum property values.
Note: For more information and newer code see the pages
To fix that:
Go into org.unicode.text.UCD/
UCD_Names.java andUCD_Types.java(These contain ugly items that should be enums nowadays.)
Find the property (easiest is to search for some other properties in the enum).
Add at end in UCD_Types. Be sure to update the limit, like
LIMIT_SCRIPT = Mandaic + 1;
Then in UCD_Names, change the corresponding name entry, both the full and abbreviated names.
Follow the format of the existing values.
For example:
In UCDNames.java in BIDI_CLASS add "LRI", "RLI", "FSI", "PDI",
In UCDNames.java in LONG_BIDI_CLASS add
"LeftToRightIsolate", "RightToLeftIsolate", "FirstStrongIsolate", "PopDirectionalIsolate",
In UCD_Types.java add & adjust
BIDI_LRI = 20,
BIDI_RLI = 21,
BIDI_FSI = 22,
BIDI_PDI = 23,
LIMIT_BIDI_CLASS = 24;
Some changes may cause collisions in the UnicodeMaps used for derived properties. You’ll find that out with an exception like:
Exception in thread “main” java.lang.IllegalArgumentException: Attempt to reset value for 17B4 when that is disallowed. Old: Control; New: Extend at org.unicode.text.UCD.ToolUnicodePropertySource$28.<init>(ToolUnicodePropertySource.java:578)
Add new scripts like other new property values. In addition, make sure there are ISO 15924 script codes, and collect CLDR script metadata. See
http://cldr.unicode.org/development/updating-codes/updating-script-metadata
http://www.unicode.org/iso15924/codechanges.html
If there are new break rules (or changes), see Segmentation-Rules.
MakeUnicodeFiles.txt
This file drives the production of the derived Unicode files. The first three lines contain parameters that you may want to modify at some times:
Generate: .*script.* // this is a regular expression. Use .* for all files
CopyrightYear: 2010 // Pick the current year
build MakeUnicodeFiles
version 6.3.0 or similar.Writing UCD_Data
Data Size: 109,802
Wrote Data 109802
For each version, the tools build a set of binary data in BIN that contain the information for that release. This is done automatically, or you can manually do it with the Program Arguments
As options, use: version 5.0.0 build
This builds a compressed format of all the UCD data (except blocks and Unihan) into the BIN directory. Don’t worry about the voluminous console messages, unless one says “FAIL”.
You have to manually do this if you change any of the data files in that version! This ought to have build files, but I haven’t worked around to it.
Note: if for any reason you modify the binary format of the BIN files, you also have to bump the value in that file:
static final byte BINARY_FORMAT = 8; // bumped if binary format of UCD changes
Diff_PropList-5.0.0d10.txt.bat
OLDER-Diff_PropList-5.0.0d10.txt.bat
UNCHANGED-Diff_PropertyValueAliases-5.0.0d10.txt.bat
Unicode 15+: See above; commit new input data, run tools, review output, copy back to input, commit, pull request…
We no longer post files to FTP folders, nor publish individual files without consistent changes in others.
Note: Also build and run the New Unicode Properties programs, since they have some additional checks.
{Generated}/UnicodeTestResults.txt
Options:
The console output shows whether any problems are found. Thus in the following case there was one failure:
ParseErrorCount=0
TestFailureCount=1
TestTestUnicodeInvariants JUnit wrapper
that runs TestUnicodeInvariants with default options,
and which is one of our CI build bot tests.**** START Test Failure.# Canonical decompositions (minus exclusions) must be identical across releases
[$Decomposition_Type:Canonical - $Full_Composition_Exclusion] = [$�Decomposition_Type:Canonical - $�Full_Composition_Exclusion]
FALSE
**** START Error Info ****
In [$�Decomposition_Type:Canonical - $�Full_Composition_Exclusion], but not in [$Decomposition_Type:Canonical - $Full_Composition_Exclusion] :
# Total code points: 0
Not in [$�Decomposition_Type:Canonical - $�Full_Composition_Exclusion], but in [$Decomposition_Type:Canonical - $Full_Composition_Exclusion] :
1B06 # Lo BALINESE LETTER AKARA TEDUNG
1B08 # Lo BALINESE LETTER IKARA TEDUNG
1B0A # Lo BALINESE LETTER UKARA TEDUNG
1B0C # Lo BALINESE LETTER RA REPA TEDUNG
1B0E # Lo BALINESE LETTER LA LENGA TEDUNG
1B12 # Lo BALINESE LETTER OKARA TEDUNG
1B3B # Mc BALINESE VOWEL SIGN RA REPA TEDUNG
1B3D # Mc BALINESE VOWEL SIGN LA LENGA TEDUNG
1B40..1B41 # Mc [2] BALINESE VOWEL SIGN TALING TEDUNG..BALINESE VOWEL SIGN TALING REPA TEDUNG
1B43 # Mc BALINESE VOWEL SIGN PEPET TEDUNG
# Total code points: 11
In both [$�Decomposition_Type:Canonical - $�Full_Composition_Exclusion], and in [$Decomposition_Type:Canonical - $Full_Composition_Exclusion] :
00C0..00C5 # L& [6] LATIN CAPITAL LETTER A WITH GRAVE..LATIN CAPITAL LETTER A WITH RING ABOVE
00C7..00CF # L& [9] LATIN CAPITAL LETTER C WITH CEDILLA..LATIN CAPITAL LETTER I WITH DIAERESIS
00D1..00D6 # L& [6] LATIN CAPITAL LETTER N WITH TILDE..LATIN CAPITAL LETTER O WITH DIAERESIS
...
30F7..30FA # Lo [4] KATAKANA LETTER VA..KATAKANA LETTER VO
30FE # Lm KATAKANA VOICED ITERATION MARK
AC00..D7A3 # Lo [11172] HANGUL SYLLABLE GA..HANGUL SYLLABLE HIH
# Total code points: 12089
**** END Error Info ****
{Generated}/UnicodeTestResults-security.txt when running TestTestUnicodeInvariants.-DSHOW_FILESunicode-test-results in the “Artifacts” section.Instructions moved to the uca tools main page.
To build all the charts, use org.unicode.text.UCA.Main, with the option:
charts
They will be built into
http://unicode.org/draft/charts/
Once UCA is released, then copy those files up to the right spots in the Unicode site: