Function icu::experimental::unicodeset_parse::parse

source ·
pub fn parse(
    source: &str,
) -> Result<(CodePointInversionListAndStringList<'static>, usize), ParseError>
Expand description

Parses a UnicodeSet pattern and returns a UnicodeSet in the form of a CodePointInversionListAndStringList, as well as the number of bytes consumed from the source string.

Supports UnicodeSets as described in UTS #35 - Unicode Sets.

The error type of the returned Result can be pretty-printed with ParseError::fmt_with_source.

§Variables

If you need support for variables inside UnicodeSets (e.g., [$start-$end]), use parse_with_variables.

§Limitations

  • Currently, we only support the ECMA-262 properties. The property names must match the exact spelling listed in ECMA-262. Note that we do support UTS35 syntax for elided General_Category and Script property names, i.e., [:Latn:] and [:Ll:] are both valid, with the former implying the Script property, and the latter the General_Category property.
  • We do not support \N{Unicode code point name} character escaping. Use any other escape method described in UTS35.

Enabled with the compiled_data Cargo feature.

📚 Help choosing a constructor

§Examples

Parse ranges

use icu::experimental::unicodeset_parse::parse;

let source = "[a-zA-Z0-9]";
let (set, consumed) = parse(source).unwrap();
let code_points = set.code_points();

assert!(code_points.contains_range('a'..='z'));
assert!(code_points.contains_range('A'..='Z'));
assert!(code_points.contains_range('0'..='9'));
assert_eq!(consumed, source.len());

Parse properties, set operations, inner sets

use icu::experimental::unicodeset_parse::parse;

let (set, _) =
    parse("[[:^ll:]-[^][:gc = Lowercase Letter:]&[^[[^]-[a-z]]]]").unwrap();
assert!(set.code_points().contains_range('a'..='z'));
assert_eq!(('a'..='z').count(), set.size());

Inversions remove strings

use icu::experimental::unicodeset_parse::parse;

let (set, _) =
    parse(r"[[a-z{hello\ world}]&[^a-y{hello\ world}]]").unwrap();
assert!(set.contains('z'));
assert_eq!(set.size(), 1);
assert!(!set.has_strings());

Set operators (including the implicit union) have the same precedence and are left-associative

use icu::experimental::unicodeset_parse::parse;

let (set, _) = parse("[[ace][bdf] - [abc][def]]").unwrap();
assert!(set.code_points().contains_range('d'..='f'));
assert_eq!(set.size(), ('d'..='f').count());

Supports partial parses

use icu::experimental::unicodeset_parse::parse;

let (set, consumed) = parse("[a-c][x-z]").unwrap();
let code_points = set.code_points();
assert!(code_points.contains_range('a'..='c'));
assert!(!code_points.contains_range('x'..='z'));
assert_eq!(set.size(), ('a'..='c').count());
// only the first UnicodeSet is parsed
assert_eq!(consumed, "[a-c]".len());