Inflection
Morphology Inflection
|
Unicode Inflection is a C/C++ library that provides support for the following tasks.
It uses C++20, ICU4C, UTF-16 strings (just like Java) and a data source of lexical dictionaries that contain relationships between inflections of a word. Just like ICU, it is thread safe between service objects, but mutable objects are not necessarily thread safe between threads.
By making this implementation open sourced, various software frameworks can generate grammatically correct messages and to lower the barriers to correctly localizing software.
Unicode Inflection is currently supported on these operating systems:
The following sections delve a bit deeper into the low-level functionality of Unicode Inflection, such as how caching, multi-threading, work with Unicode Inflection. These sections are meant as a guide to utilizing Unicode Inflection in a safe manner while also squeezing the most potential out of the library as possible.
At the time of writing, caching is a one-way street. Once an object that utilizes caching functionality with some data, it remains in-memory until the process has terminated. Reloading of such caches are not supported, since that involves ensuring that all dependencies in the process space sharing the same resources have also stopped and released the same resources.
The caching being done by Unicode Inflection lowers the lookup time for many portions of the inflection::dialog::CommonConceptFactory
operations. It is for this reason that it may be a good idea to initialize these constructs before lookup time, so that Unicode Inflection is in a "warmed up" state.
It is important to note that many of these cached data structures have ties to specific references in Unicode Inflection's memory-mapped dictionaries. This makes reloading dictionaries difficult.
Grammar synthesizers memory map lexical dictionaries and cache various grammatical structures depending on the language. Synthesized words are not cached.
Unicode Inflection is multi-thread friendly. It has std::mutex
in places where deadlocks could occur, and generally tries to abstract this away from users.
This project was donated to the Unicode consortium from Siri at Apple Inc. These additional resources may be helpful background information to reference:
The following are the dependencies to use this code.
Library | runtime | build time | test time | Note |
---|---|---|---|---|
ICU4C | ✅ | ✅ | ✅ | |
marisa | ✅ | ✅ | ✅ | statically linked |
cmake | ✅ | |||
libxml2 | ✅ | ✅ | ||
Catch2 | ✅ | automatically downloaded |
Additional checkout steps are necessary when working with the repository as it utilizes Git LFS files.
For more details and troubleshooting refer to this guide.
Before building this project, you must have a distribution of ICU4C available. The path to the ICU distribution must be set as ICU_ROOT in either options.mk or as a command line argument to cmake. The path should be the same as the –prefix value used when ICU was configured, built and installed.
Typical ICU installation requires:
Alternatively, you can use cmake for the building, testing etc.
Optionally, ICU_ROOT can be specified in the file options.mk with the following type of syntax.