[TOC]
UnicodeSets use regular-expression syntax to allow for arbitrary set operations
(Union, Intersection, Difference) on sets of Unicode characters. The base sets
can be specified explicitly, such as [a-m w-z]
, or using a combinations of
Unicode Properties such as the following, for the Arabic script characters
that have a canonical decomposition:
[[:script=arabic:]&[:decompositiontype=canonical:]]
Enter a UnicodeSet into the Input box, and hit Show Set. You can also choose certain combinations of options for display, such as abbreviated or not.
The values you use are encapsulated into a URL for reference, such as
https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=\\p{sc:Greek}
If you add properties to the Group By box, you can sort the results by
property values. For example, if you set it to General_Category Numeric_Value
(or the short form gc nv
), you’ll see the results sorted first by the general
category of the characters, and then by the numeric value.
UnicodeSets are defined according to the description on UTS #35: Locale Data Markup Language (LDML), but has some useful extensions in these online demos.
Properties can be specified either with Perl-style notation
(\p{script=arabic}
) or with POSIX-style notation ([:script=arabic:]
).
Properties and values can either use a long form (like script
) or a short form
(like sc
).
No argument is equivalent to “Yes”; mostly useful with binary properties, like
\p{isLowercase}
.
The following examples illustrate the syntax with a particular property, value
pair: the property age
and the value 3.2
:
The :
can be used in the place of =
. (Mostly because :
doesn’t require
percent-encoding in URLs.)
\p{age:3.2}
and [:age:3.2:]
The Perl and Posix syntax for negations are \P{...}
and [:^...:]
,
respectively. The characters ≠
and !
are added for convenience:
\p{age≠3.2}
and \:age≠3.2:]
\p{age!=3.2}
and \:age!=3.2:]
\p{age!:3.2}
and \:age!=3.2:]
For the name
property, regular expressions can be used for the value, enclosed
in /.../
. For example in the following expression, the first term will select
all those Unicode characters whose names contain “CJK”. The rest of the
expression will then subtract the ideographic characters, showing that these can
be used in arbitrary combinations.
[[[:name=/CJK/:]-[:ideographic:]]](https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%5B:name=/CJK/:%5D-%5B:ideographic:%5D%5D)
[[:name=/\bDOT$/:]](https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B:name=/%5CbDOT$/:%5D)
[[:block=/(?i)arab/:]](https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B:block=/(?i)arab/:%5D)
[[:toNFKC=/\./:]](https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B:toNFKC=/%5C./:%5D)
Some particularly useful regex features are:
\b
means a word break, ^
means front of the string, and $
means end. So
/^DOT\\b/
means the word DOT at the start.(?i)
means case-insensitive matching.Caveats:
/.../
pattern.[:...:]
syntax on the outside, such as:
[:Block=/Aegean_Numbers/:]
returns a different number
of characters than [:Block=Aegean_Numbers:]
, because it skips Unassigned
code points.[:Block=aegeannumbers:]
works, but [:Block=/aegeannumbers/:]
fails – you have to use
[:Block=/Aegean_Numbers/:]
or [:Block=/(?i)aegean_numbers/:]
.Property values can be compared to those for other properties, using the syntax
@...@
. For example:
\p{idna2003!=@uts46@}
\p{idna2003!=@uts46@}&\\p{age=3.2}
There is a special property “cp” that returns the code point itself. For example:
\p{toLowercase!=@cp@}
You can see a full listing of the possible properties on
https://util.unicode.org/UnicodeJsps/properties.jsp. The standard Unicode
properties are supported, plus the extra ICU properties. There are some
additional properties just in this demo. The easiest way to see the properties
for a range of characters is to use a set like [:Greek:]
in the Input, and
then set the Group By box to the property name.
Normally, \p{isX} is equivalent to \p{toX=@cp@}
. There are some exceptions and
missing cases.
Note: The Unassigned, Surrogate, and Private Use code points are skipped in the generation of some of these sets.
The following provides details for some cases.
Unicode defines a number of string casing functions in Section 3.13 Default
Case Algorithms. These string functions can also be applied to single
characters.Warning: the first three sets may be somewhat misleading:
isLowercase means that the character is the same as its lowercase version, which
includes all uncased characters. To get those characters that are cased
characters and lowercase, use
[[:isLowercase:]&[:isCased:]]
[:toLowercase=a:]
[:toCaseFold=a:]
[:toUppercase=A:]
[:toTitlecase=A:]
Note: The Unassigned, Surrogate, and Private Use code points are skipped in generation of the sets.
Unicode defines a number of string normalization functions UAX #15. These string functions can also be applied to single characters.
[:toNFC=a:]
[:toNFD=A\u0300:]
[:toNFKC=A:]
[:toNFKD=A\u0300:]