Crate segmenter

Source
Expand description

Segment strings by lines, graphemes, words, and sentences.

This module is published as its own crate (icu_segmenter) and as part of the icu crate. See the latter for more details on the ICU4X project.

This module contains segmenter implementation for the following rules.

§Examples

§Line Break

Find line break opportunities:

 use icu::segmenter::LineSegmenter;

 let segmenter = LineSegmenter::new_auto(Default::default());

 let breakpoints: Vec<usize> = segmenter
     .segment_str("Hello World. Xin chào thế giới!")
     .collect();
 assert_eq!(&breakpoints, &[0, 6, 13, 17, 23, 29, 36]);

See LineSegmenter for more examples.

§Grapheme Cluster Break

Find all grapheme cluster boundaries:

 use icu::segmenter::GraphemeClusterSegmenter;

 let segmenter = GraphemeClusterSegmenter::new();

 let breakpoints: Vec<usize> = segmenter
     .segment_str("Hello World. Xin chào thế giới!")
     .collect();
 assert_eq!(
     &breakpoints,
     &[
         0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
         19, 21, 22, 23, 24, 25, 28, 29, 30, 31, 34, 35, 36
     ]
 );

See GraphemeClusterSegmenter for more examples.

§Word Break

Find all word boundaries:

 use icu::segmenter::{options::WordBreakInvariantOptions, WordSegmenter};

 let segmenter =
     WordSegmenter::new_auto(WordBreakInvariantOptions::default());

 let breakpoints: Vec<usize> = segmenter
     .segment_str("Hello World. Xin chào thế giới!")
     .collect();
 assert_eq!(
     &breakpoints,
     &[0, 5, 6, 11, 12, 13, 16, 17, 22, 23, 28, 29, 35, 36]
 );

See WordSegmenter for more examples.

§Sentence Break

Segment the string into sentences:

 use icu::segmenter::{
     options::SentenceBreakInvariantOptions, SentenceSegmenter,
 };

 let segmenter =
     SentenceSegmenter::new(SentenceBreakInvariantOptions::default());

 let breakpoints: Vec<usize> = segmenter
     .segment_str("Hello World. Xin chào thế giới!")
     .collect();
 assert_eq!(&breakpoints, &[0, 13, 36]);

See SentenceSegmenter for more examples.

Modules§

iterators
Types supporting iteration over segments. Obtained from the segmenter types.
options
Options structs and enums
provider
🚧 [Unstable] Data provider struct definitions for this ICU4X component.
scaffold
Largely-internal scaffolding types (You should very rarely need to reference these directly)

Structs§

GraphemeClusterSegmenter
Segments a string into grapheme clusters.
GraphemeClusterSegmenterBorrowed
Segments a string into grapheme clusters (borrowed version).
LineSegmenter
Supports loading line break data, and creating line break iterators for different string encodings.
LineSegmenterBorrowed
Segments a string into lines (borrowed version).
SentenceSegmenter
Supports loading sentence break data, and creating sentence break iterators for different string encodings.
SentenceSegmenterBorrowed
Segments a string into sentences (borrowed version).
WordSegmenter
Supports loading word break data, and creating word break iterators for different string encodings.
WordSegmenterBorrowed
Segments a string into words (borrowed version).