Struct icu::segmenter::GraphemeClusterSegmenter
source · pub struct GraphemeClusterSegmenter { /* private fields */ }
Expand description
Segments a string into grapheme clusters.
Supports loading grapheme cluster break data, and creating grapheme cluster break iterators for different string encodings.
§Examples
Segment a string:
use icu::segmenter::GraphemeClusterSegmenter;
let segmenter = GraphemeClusterSegmenter::new();
let breakpoints: Vec<usize> = segmenter.segment_str("Hello 🗺").collect();
// World Map (U+1F5FA) is encoded in four bytes in UTF-8.
assert_eq!(&breakpoints, &[0, 1, 2, 3, 4, 5, 6, 10]);
Segment a Latin1 byte string:
use icu::segmenter::GraphemeClusterSegmenter;
let segmenter = GraphemeClusterSegmenter::new();
let breakpoints: Vec<usize> =
segmenter.segment_latin1(b"Hello World").collect();
assert_eq!(&breakpoints, &[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]);
Successive boundaries can be used to retrieve the grapheme clusters. In particular, the first boundary is always 0, and the last one is the length of the segmented text in code units.
use itertools::Itertools;
let text = "मांजर";
let grapheme_clusters: Vec<&str> = segmenter
.segment_str(text)
.tuple_windows()
.map(|(i, j)| &text[i..j])
.collect();
assert_eq!(&grapheme_clusters, &["मां", "ज", "र"]);
This segmenter applies all rules provided to the constructor.
Thus, if the data supplied by the provider comprises all
grapheme cluster boundary rules from Unicode Standard Annex #29,
Unicode Text Segmentation, which is the case of default data
(both test data and data produced by icu_provider_source
), the segment_*
functions return extended grapheme cluster boundaries, as opposed to
legacy grapheme cluster boundaries. See Section 3, Grapheme Cluster
Boundaries, and Table 1a, Sample Grapheme Clusters,
in Unicode Standard Annex #29, Unicode Text Segmentation.
use icu::segmenter::GraphemeClusterSegmenter;
let segmenter =
GraphemeClusterSegmenter::new();
// நி (TAMIL LETTER NA, TAMIL VOWEL SIGN I) is an extended grapheme cluster,
// but not a legacy grapheme cluster.
let ni = "நி";
let egc_boundaries: Vec<usize> = segmenter.segment_str(ni).collect();
assert_eq!(&egc_boundaries, &[0, ni.len()]);
Implementations§
source§impl GraphemeClusterSegmenter
impl GraphemeClusterSegmenter
sourcepub fn new() -> GraphemeClusterSegmenter
pub fn new() -> GraphemeClusterSegmenter
Constructs a GraphemeClusterSegmenter
with an invariant locale from compiled data.
✨ Enabled with the compiled_data
Cargo feature.
sourcepub fn try_new_with_any_provider(
provider: &(impl AnyProvider + ?Sized),
) -> Result<GraphemeClusterSegmenter, DataError>
pub fn try_new_with_any_provider( provider: &(impl AnyProvider + ?Sized), ) -> Result<GraphemeClusterSegmenter, DataError>
A version of [Self :: new
] that uses custom data provided by an AnyProvider
.
sourcepub fn try_new_with_buffer_provider(
provider: &(impl BufferProvider + ?Sized),
) -> Result<GraphemeClusterSegmenter, DataError>
pub fn try_new_with_buffer_provider( provider: &(impl BufferProvider + ?Sized), ) -> Result<GraphemeClusterSegmenter, DataError>
A version of [Self :: new
] that uses custom data provided by a BufferProvider
.
✨ Enabled with the serde
feature.
sourcepub fn try_new_unstable<D>(
provider: &D,
) -> Result<GraphemeClusterSegmenter, DataError>
pub fn try_new_unstable<D>( provider: &D, ) -> Result<GraphemeClusterSegmenter, DataError>
A version of Self::new
that uses custom data provided by a DataProvider
.
sourcepub fn segment_str<'l, 's>(
&'l self,
input: &'s str,
) -> GraphemeClusterBreakIterator<'l, 's, RuleBreakTypeUtf8> ⓘ
pub fn segment_str<'l, 's>( &'l self, input: &'s str, ) -> GraphemeClusterBreakIterator<'l, 's, RuleBreakTypeUtf8> ⓘ
Creates a grapheme cluster break iterator for an str
(a UTF-8 string).
sourcepub fn segment_utf8<'l, 's>(
&'l self,
input: &'s [u8],
) -> GraphemeClusterBreakIterator<'l, 's, RuleBreakTypePotentiallyIllFormedUtf8> ⓘ
pub fn segment_utf8<'l, 's>( &'l self, input: &'s [u8], ) -> GraphemeClusterBreakIterator<'l, 's, RuleBreakTypePotentiallyIllFormedUtf8> ⓘ
Creates a grapheme cluster break iterator for a potentially ill-formed UTF8 string
Invalid characters are treated as REPLACEMENT CHARACTER
There are always breakpoints at 0 and the string length, or only at 0 for the empty string.
sourcepub fn segment_latin1<'l, 's>(
&'l self,
input: &'s [u8],
) -> GraphemeClusterBreakIterator<'l, 's, RuleBreakTypeLatin1> ⓘ
pub fn segment_latin1<'l, 's>( &'l self, input: &'s [u8], ) -> GraphemeClusterBreakIterator<'l, 's, RuleBreakTypeLatin1> ⓘ
Creates a grapheme cluster break iterator for a Latin-1 (8-bit) string.
There are always breakpoints at 0 and the string length, or only at 0 for the empty string.
sourcepub fn segment_utf16<'l, 's>(
&'l self,
input: &'s [u16],
) -> GraphemeClusterBreakIterator<'l, 's, RuleBreakTypeUtf16> ⓘ
pub fn segment_utf16<'l, 's>( &'l self, input: &'s [u16], ) -> GraphemeClusterBreakIterator<'l, 's, RuleBreakTypeUtf16> ⓘ
Creates a grapheme cluster break iterator for a UTF-16 string.
There are always breakpoints at 0 and the string length, or only at 0 for the empty string.
Trait Implementations§
source§impl Debug for GraphemeClusterSegmenter
impl Debug for GraphemeClusterSegmenter
source§impl Default for GraphemeClusterSegmenter
impl Default for GraphemeClusterSegmenter
source§fn default() -> GraphemeClusterSegmenter
fn default() -> GraphemeClusterSegmenter
Auto Trait Implementations§
impl Freeze for GraphemeClusterSegmenter
impl RefUnwindSafe for GraphemeClusterSegmenter
impl Send for GraphemeClusterSegmenter
impl Sync for GraphemeClusterSegmenter
impl Unpin for GraphemeClusterSegmenter
impl UnwindSafe for GraphemeClusterSegmenter
Blanket Implementations§
source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
source§impl<T> IntoEither for T
impl<T> IntoEither for T
source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self
into a Left
variant of Either<Self, Self>
if into_left
is true
.
Converts self
into a Right
variant of Either<Self, Self>
otherwise. Read moresource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self
into a Left
variant of Either<Self, Self>
if into_left(&self)
returns true
.
Converts self
into a Right
variant of Either<Self, Self>
otherwise. Read more