Function icu::experimental::unicodeset_parse::parse
source · pub fn parse(
source: &str,
) -> Result<(CodePointInversionListAndStringList<'static>, usize), ParseError>
Expand description
Parses a UnicodeSet pattern and returns a UnicodeSet in the form of a CodePointInversionListAndStringList
,
as well as the number of bytes consumed from the source string.
Supports UnicodeSets as described in UTS #35 - Unicode Sets.
The error type of the returned Result can be pretty-printed with ParseError::fmt_with_source
.
§Variables
If you need support for variables inside UnicodeSets (e.g., [$start-$end]
), use parse_with_variables
.
§Limitations
- Currently, we only support the ECMA-262 properties.
The property names must match the exact spelling listed in ECMA-262. Note that we do support UTS35 syntax for elided
General_Category
andScript
property names, i.e.,[:Latn:]
and[:Ll:]
are both valid, with the former implying theScript
property, and the latter theGeneral_Category
property. - We do not support
\N{Unicode code point name}
character escaping. Use any other escape method described in UTS35.
✨ Enabled with the compiled_data
Cargo feature.
§Examples
Parse ranges
use icu::experimental::unicodeset_parse::parse;
let source = "[a-zA-Z0-9]";
let (set, consumed) = parse(source).unwrap();
let code_points = set.code_points();
assert!(code_points.contains_range('a'..='z'));
assert!(code_points.contains_range('A'..='Z'));
assert!(code_points.contains_range('0'..='9'));
assert_eq!(consumed, source.len());
Parse properties, set operations, inner sets
use icu::experimental::unicodeset_parse::parse;
let (set, _) =
parse("[[:^ll:]-[^][:gc = Lowercase Letter:]&[^[[^]-[a-z]]]]").unwrap();
assert!(set.code_points().contains_range('a'..='z'));
assert_eq!(('a'..='z').count(), set.size());
Inversions remove strings
use icu::experimental::unicodeset_parse::parse;
let (set, _) =
parse(r"[[a-z{hello\ world}]&[^a-y{hello\ world}]]").unwrap();
assert!(set.contains('z'));
assert_eq!(set.size(), 1);
assert!(!set.has_strings());
Set operators (including the implicit union) have the same precedence and are left-associative
use icu::experimental::unicodeset_parse::parse;
let (set, _) = parse("[[ace][bdf] - [abc][def]]").unwrap();
assert!(set.code_points().contains_range('d'..='f'));
assert_eq!(set.size(), ('d'..='f').count());
Supports partial parses
use icu::experimental::unicodeset_parse::parse;
let (set, consumed) = parse("[a-c][x-z]").unwrap();
let code_points = set.code_points();
assert!(code_points.contains_range('a'..='c'));
assert!(!code_points.contains_range('x'..='z'));
assert_eq!(set.size(), ('a'..='c').count());
// only the first UnicodeSet is parsed
assert_eq!(consumed, "[a-c]".len());