ICU 76.1 76.1
|
C API: Unicode Script Information. More...
#include "unicode/utypes.h"
Go to the source code of this file.
Namespaces | |
namespace | icu |
File coll.h. | |
Typedefs | |
typedef enum UScriptCode | UScriptCode |
Constants for ISO 15924 script codes. | |
typedef enum UScriptUsage | UScriptUsage |
Script usage constants. | |
Functions | |
U_CAPI int32_t | uscript_getCode (const char *nameOrAbbrOrLocale, UScriptCode *fillIn, int32_t capacity, UErrorCode *err) |
Gets the script codes associated with the given locale or ISO 15924 abbreviation or name. | |
U_CAPI const char * | uscript_getName (UScriptCode scriptCode) |
Returns the long Unicode script name, if there is one. | |
U_CAPI const char * | uscript_getShortName (UScriptCode scriptCode) |
Returns the 4-letter ISO 15924 script code, which is the same as the short Unicode script name if Unicode has names for the script. | |
U_CAPI UScriptCode | uscript_getScript (UChar32 codepoint, UErrorCode *err) |
Gets the script code associated with the given codepoint. | |
U_CAPI UBool | uscript_hasScript (UChar32 c, UScriptCode sc) |
Do the Script_Extensions of code point c contain script sc? If c does not have explicit Script_Extensions, then this tests whether c has the Script property value sc. | |
U_CAPI int32_t | uscript_getScriptExtensions (UChar32 c, UScriptCode *scripts, int32_t capacity, UErrorCode *errorCode) |
Writes code point c's Script_Extensions as a list of UScriptCode values to the output scripts array and returns the number of script codes. | |
U_CAPI int32_t | uscript_getSampleString (UScriptCode script, UChar *dest, int32_t capacity, UErrorCode *pErrorCode) |
Writes the script sample character string. | |
U_COMMON_API icu::UnicodeString | uscript_getSampleUnicodeString (UScriptCode script) |
Returns the script sample character string. | |
U_CAPI UScriptUsage | uscript_getUsage (UScriptCode script) |
Returns the script usage according to UAX #31 Unicode Identifier and Pattern Syntax. | |
U_CAPI UBool | uscript_isRightToLeft (UScriptCode script) |
Returns true if the script is written right-to-left. | |
U_CAPI UBool | uscript_breaksBetweenLetters (UScriptCode script) |
Returns true if the script allows line breaks between letters (excluding hyphenation). | |
U_CAPI UBool | uscript_isCased (UScriptCode script) |
Returns true if in modern (or most recent) usage of the script case distinctions are customary. | |
C API: Unicode Script Information.
Definition in file uscript.h.
typedef enum UScriptCode UScriptCode |
Constants for ISO 15924 script codes.
The current set of script code constants supports at least all scripts that are encoded in the version of Unicode which ICU currently supports. The names of the constants are usually derived from the Unicode script property value aliases. See UAX #24 Unicode Script Property (http://www.unicode.org/reports/tr24/) and http://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt .
In addition, constants for many ISO 15924 script codes are included, for use with language tags, CLDR data, and similar. Some of those codes are not used in the Unicode Character Database (UCD). For example, there are no characters that have a UCD script property value of Hans or Hant. All Han ideographs have the Hani script property value in Unicode.
Private-use codes Qaaa..Qabx are not included, except as used in the UCD or in CLDR.
Starting with ICU 55, script codes are only added when their scripts have been or will certainly be encoded in Unicode, and have been assigned Unicode script property value aliases, to ensure that their script names are stable and match the names of the constants. Script codes like Latf and Aran that are not subject to separate encoding may be added at any time.
typedef enum UScriptUsage UScriptUsage |
Script usage constants.
See UAX #31 Unicode Identifier and Pattern Syntax. http://www.unicode.org/reports/tr31/#Table_Candidate_Characters_for_Exclusion_from_Identifiers
enum UScriptCode |
Constants for ISO 15924 script codes.
The current set of script code constants supports at least all scripts that are encoded in the version of Unicode which ICU currently supports. The names of the constants are usually derived from the Unicode script property value aliases. See UAX #24 Unicode Script Property (http://www.unicode.org/reports/tr24/) and http://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt .
In addition, constants for many ISO 15924 script codes are included, for use with language tags, CLDR data, and similar. Some of those codes are not used in the Unicode Character Database (UCD). For example, there are no characters that have a UCD script property value of Hans or Hant. All Han ideographs have the Hani script property value in Unicode.
Private-use codes Qaaa..Qabx are not included, except as used in the UCD or in CLDR.
Starting with ICU 55, script codes are only added when their scripts have been or will certainly be encoded in Unicode, and have been assigned Unicode script property value aliases, to ensure that their script names are stable and match the names of the constants. Script codes like Latf and Aran that are not subject to separate encoding may be added at any time.
Enumerator | |
---|---|
USCRIPT_INVALID_CODE |
|
USCRIPT_COMMON |
|
USCRIPT_INHERITED |
|
USCRIPT_ARABIC |
|
USCRIPT_ARMENIAN |
|
USCRIPT_BENGALI |
|
USCRIPT_BOPOMOFO |
|
USCRIPT_CHEROKEE |
|
USCRIPT_COPTIC |
|
USCRIPT_CYRILLIC |
|
USCRIPT_DESERET |
|
USCRIPT_DEVANAGARI |
|
USCRIPT_ETHIOPIC |
|
USCRIPT_GEORGIAN |
|
USCRIPT_GOTHIC |
|
USCRIPT_GREEK |
|
USCRIPT_GUJARATI |
|
USCRIPT_GURMUKHI |
|
USCRIPT_HAN |
|
USCRIPT_HANGUL |
|
USCRIPT_HEBREW |
|
USCRIPT_HIRAGANA |
|
USCRIPT_KANNADA |
|
USCRIPT_KATAKANA |
|
USCRIPT_KHMER |
|
USCRIPT_LAO |
|
USCRIPT_LATIN |
|
USCRIPT_MALAYALAM |
|
USCRIPT_MONGOLIAN |
|
USCRIPT_MYANMAR |
|
USCRIPT_OGHAM |
|
USCRIPT_OLD_ITALIC |
|
USCRIPT_ORIYA |
|
USCRIPT_RUNIC |
|
USCRIPT_SINHALA |
|
USCRIPT_SYRIAC |
|
USCRIPT_TAMIL |
|
USCRIPT_TELUGU |
|
USCRIPT_THAANA |
|
USCRIPT_THAI |
|
USCRIPT_TIBETAN |
|
USCRIPT_CANADIAN_ABORIGINAL | Canadian_Aboriginal script.
|
USCRIPT_UCAS | Canadian_Aboriginal script (alias).
|
USCRIPT_YI |
|
USCRIPT_TAGALOG |
|
USCRIPT_HANUNOO |
|
USCRIPT_BUHID |
|
USCRIPT_TAGBANWA |
|
USCRIPT_BRAILLE |
|
USCRIPT_CYPRIOT |
|
USCRIPT_LIMBU |
|
USCRIPT_LINEAR_B |
|
USCRIPT_OSMANYA |
|
USCRIPT_SHAVIAN |
|
USCRIPT_TAI_LE |
|
USCRIPT_UGARITIC |
|
USCRIPT_KATAKANA_OR_HIRAGANA | New script code in Unicode 4.0.1.
|
USCRIPT_BUGINESE |
|
USCRIPT_GLAGOLITIC |
|
USCRIPT_KHAROSHTHI |
|
USCRIPT_SYLOTI_NAGRI |
|
USCRIPT_NEW_TAI_LUE |
|
USCRIPT_TIFINAGH |
|
USCRIPT_OLD_PERSIAN |
|
USCRIPT_BALINESE |
|
USCRIPT_BATAK |
|
USCRIPT_BLISSYMBOLS |
|
USCRIPT_BRAHMI |
|
USCRIPT_CHAM |
|
USCRIPT_CIRTH |
|
USCRIPT_OLD_CHURCH_SLAVONIC_CYRILLIC |
|
USCRIPT_DEMOTIC_EGYPTIAN |
|
USCRIPT_HIERATIC_EGYPTIAN |
|
USCRIPT_EGYPTIAN_HIEROGLYPHS |
|
USCRIPT_KHUTSURI |
|
USCRIPT_SIMPLIFIED_HAN |
|
USCRIPT_TRADITIONAL_HAN |
|
USCRIPT_PAHAWH_HMONG |
|
USCRIPT_OLD_HUNGARIAN |
|
USCRIPT_HARAPPAN_INDUS |
|
USCRIPT_JAVANESE |
|
USCRIPT_KAYAH_LI |
|
USCRIPT_LATIN_FRAKTUR |
|
USCRIPT_LATIN_GAELIC |
|
USCRIPT_LEPCHA |
|
USCRIPT_LINEAR_A |
|
USCRIPT_MANDAIC |
|
USCRIPT_MANDAEAN |
|
USCRIPT_MAYAN_HIEROGLYPHS |
|
USCRIPT_MEROITIC_HIEROGLYPHS |
|
USCRIPT_MEROITIC |
|
USCRIPT_NKO |
|
USCRIPT_ORKHON |
|
USCRIPT_OLD_PERMIC |
|
USCRIPT_PHAGS_PA |
|
USCRIPT_PHOENICIAN |
|
USCRIPT_MIAO |
|
USCRIPT_PHONETIC_POLLARD |
|
USCRIPT_RONGORONGO |
|
USCRIPT_SARATI |
|
USCRIPT_ESTRANGELO_SYRIAC |
|
USCRIPT_WESTERN_SYRIAC |
|
USCRIPT_EASTERN_SYRIAC |
|
USCRIPT_TENGWAR |
|
USCRIPT_VAI |
|
USCRIPT_VISIBLE_SPEECH |
|
USCRIPT_CUNEIFORM |
|
USCRIPT_UNWRITTEN_LANGUAGES |
|
USCRIPT_UNKNOWN |
|
USCRIPT_CARIAN |
|
USCRIPT_JAPANESE |
|
USCRIPT_LANNA |
|
USCRIPT_LYCIAN |
|
USCRIPT_LYDIAN |
|
USCRIPT_OL_CHIKI |
|
USCRIPT_REJANG |
|
USCRIPT_SAURASHTRA |
|
USCRIPT_SIGN_WRITING | Sutton SignWriting.
|
USCRIPT_SUNDANESE |
|
USCRIPT_MOON |
|
USCRIPT_MEITEI_MAYEK |
|
USCRIPT_IMPERIAL_ARAMAIC |
|
USCRIPT_AVESTAN |
|
USCRIPT_CHAKMA |
|
USCRIPT_KOREAN |
|
USCRIPT_KAITHI |
|
USCRIPT_MANICHAEAN |
|
USCRIPT_INSCRIPTIONAL_PAHLAVI |
|
USCRIPT_PSALTER_PAHLAVI |
|
USCRIPT_BOOK_PAHLAVI |
|
USCRIPT_INSCRIPTIONAL_PARTHIAN |
|
USCRIPT_SAMARITAN |
|
USCRIPT_TAI_VIET |
|
USCRIPT_MATHEMATICAL_NOTATION |
|
USCRIPT_SYMBOLS |
|
USCRIPT_BAMUM |
|
USCRIPT_LISU |
|
USCRIPT_NAKHI_GEBA |
|
USCRIPT_OLD_SOUTH_ARABIAN |
|
USCRIPT_BASSA_VAH |
|
USCRIPT_DUPLOYAN |
|
USCRIPT_DUPLOYAN_SHORTAND |
|
USCRIPT_ELBASAN |
|
USCRIPT_GRANTHA |
|
USCRIPT_KPELLE |
|
USCRIPT_LOMA |
|
USCRIPT_MENDE | Mende Kikakui.
|
USCRIPT_MEROITIC_CURSIVE |
|
USCRIPT_OLD_NORTH_ARABIAN |
|
USCRIPT_NABATAEAN |
|
USCRIPT_PALMYRENE |
|
USCRIPT_KHUDAWADI |
|
USCRIPT_SINDHI |
|
USCRIPT_WARANG_CITI |
|
USCRIPT_AFAKA |
|
USCRIPT_JURCHEN |
|
USCRIPT_MRO |
|
USCRIPT_NUSHU |
|
USCRIPT_SHARADA |
|
USCRIPT_SORA_SOMPENG |
|
USCRIPT_TAKRI |
|
USCRIPT_TANGUT |
|
USCRIPT_WOLEAI |
|
USCRIPT_ANATOLIAN_HIEROGLYPHS |
|
USCRIPT_KHOJKI |
|
USCRIPT_TIRHUTA |
|
USCRIPT_CAUCASIAN_ALBANIAN |
|
USCRIPT_MAHAJANI |
|
USCRIPT_AHOM |
|
USCRIPT_HATRAN |
|
USCRIPT_MODI |
|
USCRIPT_MULTANI |
|
USCRIPT_PAU_CIN_HAU |
|
USCRIPT_SIDDHAM |
|
USCRIPT_ADLAM |
|
USCRIPT_BHAIKSUKI |
|
USCRIPT_MARCHEN |
|
USCRIPT_NEWA |
|
USCRIPT_OSAGE |
|
USCRIPT_HAN_WITH_BOPOMOFO |
|
USCRIPT_JAMO |
|
USCRIPT_SYMBOLS_EMOJI |
|
USCRIPT_MASARAM_GONDI |
|
USCRIPT_SOYOMBO |
|
USCRIPT_ZANABAZAR_SQUARE |
|
USCRIPT_DOGRA |
|
USCRIPT_GUNJALA_GONDI |
|
USCRIPT_MAKASAR |
|
USCRIPT_MEDEFAIDRIN |
|
USCRIPT_HANIFI_ROHINGYA |
|
USCRIPT_SOGDIAN |
|
USCRIPT_OLD_SOGDIAN |
|
USCRIPT_ELYMAIC |
|
USCRIPT_NYIAKENG_PUACHUE_HMONG |
|
USCRIPT_NANDINAGARI |
|
USCRIPT_WANCHO |
|
USCRIPT_CHORASMIAN |
|
USCRIPT_DIVES_AKURU |
|
USCRIPT_KHITAN_SMALL_SCRIPT |
|
USCRIPT_YEZIDI |
|
USCRIPT_CYPRO_MINOAN |
|
USCRIPT_OLD_UYGHUR |
|
USCRIPT_TANGSA |
|
USCRIPT_TOTO |
|
USCRIPT_VITHKUQI |
|
USCRIPT_KAWI |
|
USCRIPT_NAG_MUNDARI |
|
USCRIPT_ARABIC_NASTALIQ |
|
USCRIPT_GARAY |
|
USCRIPT_GURUNG_KHEMA |
|
USCRIPT_KIRAT_RAI |
|
USCRIPT_OL_ONAL |
|
USCRIPT_SUNUWAR |
|
USCRIPT_TODHRI |
|
USCRIPT_TULU_TIGALARI |
|
USCRIPT_CODE_LIMIT | One more than the highest normal UScriptCode value. The highest value is available via u_getIntPropertyMaxValue(UCHAR_SCRIPT).
|
enum UScriptUsage |
Script usage constants.
See UAX #31 Unicode Identifier and Pattern Syntax. http://www.unicode.org/reports/tr31/#Table_Candidate_Characters_for_Exclusion_from_Identifiers
Enumerator | |
---|---|
USCRIPT_USAGE_NOT_ENCODED | Not encoded in Unicode.
|
USCRIPT_USAGE_UNKNOWN | Unknown script usage.
|
USCRIPT_USAGE_EXCLUDED | Candidate for Exclusion from Identifiers.
|
USCRIPT_USAGE_LIMITED_USE | Limited Use script.
|
USCRIPT_USAGE_ASPIRATIONAL | Aspirational Use script.
|
USCRIPT_USAGE_RECOMMENDED | Recommended script.
|
U_CAPI UBool uscript_breaksBetweenLetters | ( | UScriptCode | script | ) |
Returns true if the script allows line breaks between letters (excluding hyphenation).
Such a script typically requires dictionary-based line breaking. For example, Hani and Thai.
script | script code |
U_CAPI int32_t uscript_getCode | ( | const char * | nameOrAbbrOrLocale, |
UScriptCode * | fillIn, | ||
int32_t | capacity, | ||
UErrorCode * | err | ||
) |
Gets the script codes associated with the given locale or ISO 15924 abbreviation or name.
Fills in USCRIPT_MALAYALAM given "Malayam" OR "Mlym". Fills in USCRIPT_LATIN given "en" OR "en_US" If the required capacity is greater than the capacity of the destination buffer, then the error code is set to U_BUFFER_OVERFLOW_ERROR and the required capacity is returned.
Note: To search by short or long script alias only, use u_getPropertyValueEnum(UCHAR_SCRIPT, alias) instead. That does a fast lookup with no access of the locale data.
nameOrAbbrOrLocale | name of the script, as given in PropertyValueAliases.txt, or ISO 15924 code or locale |
fillIn | the UScriptCode buffer to fill in the script code |
capacity | the capacity (size) of UScriptCode buffer passed in. |
err | the error status code. |
U_CAPI const char * uscript_getName | ( | UScriptCode | scriptCode | ) |
Returns the long Unicode script name, if there is one.
Otherwise returns the 4-letter ISO 15924 script code. Returns "Malayam" given USCRIPT_MALAYALAM.
scriptCode | UScriptCode enum |
U_CAPI int32_t uscript_getSampleString | ( | UScriptCode | script, |
UChar * | dest, | ||
int32_t | capacity, | ||
UErrorCode * | pErrorCode | ||
) |
Writes the script sample character string.
This string normally consists of one code point but might be longer. The string is empty if the script is not encoded.
script | script code |
dest | output string array |
capacity | number of UChars in the dest array |
pErrorCode | standard ICU in/out error code, must pass U_SUCCESS() on input |
U_COMMON_API icu::UnicodeString uscript_getSampleUnicodeString | ( | UScriptCode | script | ) |
Returns the script sample character string.
This string normally consists of one code point but might be longer. The string is empty if the script is not encoded.
script | script code |
U_CAPI UScriptCode uscript_getScript | ( | UChar32 | codepoint, |
UErrorCode * | err | ||
) |
Gets the script code associated with the given codepoint.
Returns USCRIPT_MALAYALAM given 0x0D02
codepoint | UChar32 codepoint |
err | the error status code. |
U_CAPI int32_t uscript_getScriptExtensions | ( | UChar32 | c, |
UScriptCode * | scripts, | ||
int32_t | capacity, | ||
UErrorCode * | errorCode | ||
) |
Writes code point c's Script_Extensions as a list of UScriptCode values to the output scripts array and returns the number of script codes.
Some characters are commonly used in multiple scripts. For more information, see UAX #24: http://www.unicode.org/reports/tr24/.
If there are more than capacity script codes to be written, then U_BUFFER_OVERFLOW_ERROR is set and the number of Script_Extensions is returned. (Usual ICU buffer handling behavior.)
c | code point |
scripts | output script code array |
capacity | capacity of the scripts array |
errorCode | Standard ICU error code. Its input value must pass the U_SUCCESS() test, or else the function returns immediately. Check for U_FAILURE() on output or use with function chaining. (See User Guide for details.) |
U_CAPI const char * uscript_getShortName | ( | UScriptCode | scriptCode | ) |
Returns the 4-letter ISO 15924 script code, which is the same as the short Unicode script name if Unicode has names for the script.
Returns "Mlym" given USCRIPT_MALAYALAM.
scriptCode | UScriptCode enum |
U_CAPI UScriptUsage uscript_getUsage | ( | UScriptCode | script | ) |
Returns the script usage according to UAX #31 Unicode Identifier and Pattern Syntax.
Returns USCRIPT_USAGE_NOT_ENCODED if the script is not encoded in Unicode.
script | script code |
U_CAPI UBool uscript_hasScript | ( | UChar32 | c, |
UScriptCode | sc | ||
) |
Do the Script_Extensions of code point c contain script sc? If c does not have explicit Script_Extensions, then this tests whether c has the Script property value sc.
Some characters are commonly used in multiple scripts. For more information, see UAX #24: http://www.unicode.org/reports/tr24/.
c | code point |
sc | script code |
U_CAPI UBool uscript_isCased | ( | UScriptCode | script | ) |
Returns true if in modern (or most recent) usage of the script case distinctions are customary.
For example, Latn and Cyrl.
script | script code |
U_CAPI UBool uscript_isRightToLeft | ( | UScriptCode | script | ) |
Returns true if the script is written right-to-left.
For example, Arab and Hebr.
script | script code |