Struct icu_segmenter::LineSegmenter

source Β·
pub struct LineSegmenter { /* private fields */ }
Expand description

Supports loading line break data, and creating line break iterators for different string encodings.

The segmenter returns mandatory breaks (as defined by definition LD7 of Unicode Standard Annex #14, Unicode Line Breaking Algorithm) as well as line break opportunities (definition LD3). It does not distinguish them. Callers requiring that distinction can check the Line_Break property of the code point preceding the break against those listed in rules LB4 and LB5, special-casing the end of text according to LB3.

For consistency with the grapheme, word, and sentence segmenters, there is always a breakpoint returned at index 0, but this breakpoint is not a meaningful line break opportunity.

let text = "Summary\r\nThis annex…";
let breakpoints: Vec<usize> = segmenter.segment_str(text).collect();
// 9 and 22 are mandatory breaks, 14 is a line break opportunity.
assert_eq!(&breakpoints, &[0, 9, 14, 22]);

// There is a break opportunity between emoji, but not within the ZWJ sequence πŸ³οΈβ€πŸŒˆ.
let flag_equation = "πŸ³οΈβž•πŸŒˆπŸŸ°πŸ³οΈ\u{200D}🌈";
let possible_first_lines: Vec<&str> =
    segmenter.segment_str(flag_equation).skip(1).map(|i| &flag_equation[..i]).collect();
assert_eq!(
    &possible_first_lines,
    &[
        "🏳️",
        "πŸ³οΈβž•",
        "πŸ³οΈβž•πŸŒˆ",
        "πŸ³οΈβž•πŸŒˆπŸŸ°",
        "πŸ³οΈβž•πŸŒˆπŸŸ°πŸ³οΈβ€πŸŒˆ"
    ]
);

Β§Examples

Segment a string with default options:

use icu::segmenter::LineSegmenter;

let segmenter = LineSegmenter::new_auto();

let breakpoints: Vec<usize> =
    segmenter.segment_str("Hello World").collect();
assert_eq!(&breakpoints, &[0, 6, 11]);

Segment a string with CSS option overrides:

use icu::segmenter::{
    LineBreakOptions, LineBreakStrictness, LineBreakWordOption,
    LineSegmenter,
};

let mut options = LineBreakOptions::default();
options.strictness = LineBreakStrictness::Strict;
options.word_option = LineBreakWordOption::BreakAll;
options.content_locale = None;
let segmenter = LineSegmenter::new_auto_with_options(options);

let breakpoints: Vec<usize> =
    segmenter.segment_str("Hello World").collect();
assert_eq!(&breakpoints, &[0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 11]);

Segment a Latin1 byte string:

use icu::segmenter::LineSegmenter;

let segmenter = LineSegmenter::new_auto();

let breakpoints: Vec<usize> =
    segmenter.segment_latin1(b"Hello World").collect();
assert_eq!(&breakpoints, &[0, 6, 11]);

Separate mandatory breaks from the break opportunities:

use icu::properties::{props::LineBreak, CodePointMapData};
use icu::segmenter::LineSegmenter;

let text = "Summary\r\nThis annex…";

let mandatory_breaks: Vec<usize> = segmenter
    .segment_str(text)
    .into_iter()
    .filter(|&i| {
        text[..i].chars().next_back().map_or(false, |c| {
            matches!(
                CodePointMapData::<LineBreak>::new().get(c),
                LineBreak::MandatoryBreak
                    | LineBreak::CarriageReturn
                    | LineBreak::LineFeed
                    | LineBreak::NextLine
            ) || i == text.len()
        })
    })
    .collect();
assert_eq!(&mandatory_breaks, &[9, 22]);

Implementations§

source§

impl LineSegmenter

source

pub fn new_auto() -> Self

Constructs a LineSegmenter with an invariant locale and the best available compiled data for complex scripts (Khmer, Lao, Myanmar, and Thai).

The current behavior, which is subject to change, is to use the LSTM model when available.

See also Self::new_auto_with_options.

✨ Enabled with the compiled_data and auto Cargo features.

πŸ“š Help choosing a constructor

source

pub fn try_new_auto_with_any_provider( provider: &(impl AnyProvider + ?Sized), ) -> Result<Self, DataError>

A version of [Self :: new_auto] that uses custom data provided by an AnyProvider.

πŸ“š Help choosing a constructor

source

pub fn try_new_auto_with_buffer_provider( provider: &(impl BufferProvider + ?Sized), ) -> Result<Self, DataError>

A version of [Self :: new_auto] that uses custom data provided by a BufferProvider.

✨ Enabled with the serde feature.

πŸ“š Help choosing a constructor

source

pub fn try_new_auto_unstable<D>(provider: &D) -> Result<Self, DataError>
where D: DataProvider<LineBreakDataV2Marker> + DataProvider<LstmForWordLineAutoV1Marker> + DataProvider<GraphemeClusterBreakDataV2Marker> + ?Sized,

A version of Self::new_auto that uses custom data provided by a DataProvider.

πŸ“š Help choosing a constructor

⚠️ The bounds on provider may change over time, including in SemVer minor releases.
source

pub fn new_lstm() -> Self

Constructs a LineSegmenter with an invariant locale and compiled LSTM data for complex scripts (Khmer, Lao, Myanmar, and Thai).

The LSTM, or Long Term Short Memory, is a machine learning model. It is smaller than the full dictionary but more expensive during segmentation (inference).

See also Self::new_lstm_with_options.

✨ Enabled with the compiled_data and lstm Cargo features.

πŸ“š Help choosing a constructor

source

pub fn try_new_lstm_with_any_provider( provider: &(impl AnyProvider + ?Sized), ) -> Result<Self, DataError>

A version of [Self :: new_lstm] that uses custom data provided by an AnyProvider.

πŸ“š Help choosing a constructor

source

pub fn try_new_lstm_with_buffer_provider( provider: &(impl BufferProvider + ?Sized), ) -> Result<Self, DataError>

A version of [Self :: new_lstm] that uses custom data provided by a BufferProvider.

✨ Enabled with the serde feature.

πŸ“š Help choosing a constructor

source

pub fn try_new_lstm_unstable<D>(provider: &D) -> Result<Self, DataError>
where D: DataProvider<LineBreakDataV2Marker> + DataProvider<LstmForWordLineAutoV1Marker> + DataProvider<GraphemeClusterBreakDataV2Marker> + ?Sized,

A version of Self::new_lstm that uses custom data provided by a DataProvider.

πŸ“š Help choosing a constructor

⚠️ The bounds on provider may change over time, including in SemVer minor releases.
source

pub fn new_dictionary() -> Self

Constructs a LineSegmenter with an invariant locale and compiled dictionary data for complex scripts (Khmer, Lao, Myanmar, and Thai).

The dictionary model uses a list of words to determine appropriate breakpoints. It is faster than the LSTM model but requires more data.

See also Self::new_dictionary_with_options.

✨ Enabled with the compiled_data Cargo feature.

πŸ“š Help choosing a constructor

source

pub fn try_new_dictionary_with_any_provider( provider: &(impl AnyProvider + ?Sized), ) -> Result<Self, DataError>

A version of [Self :: new_dictionary] that uses custom data provided by an AnyProvider.

πŸ“š Help choosing a constructor

source

pub fn try_new_dictionary_with_buffer_provider( provider: &(impl BufferProvider + ?Sized), ) -> Result<Self, DataError>

A version of [Self :: new_dictionary] that uses custom data provided by a BufferProvider.

✨ Enabled with the serde feature.

πŸ“š Help choosing a constructor

source

pub fn try_new_dictionary_unstable<D>(provider: &D) -> Result<Self, DataError>

A version of Self::new_dictionary that uses custom data provided by a DataProvider.

πŸ“š Help choosing a constructor

⚠️ The bounds on provider may change over time, including in SemVer minor releases.
source

pub fn new_auto_with_options(options: LineBreakOptions<'_>) -> Self

Constructs a LineSegmenter with an invariant locale, custom LineBreakOptions, and the best available compiled data for complex scripts (Khmer, Lao, Myanmar, and Thai).

The current behavior, which is subject to change, is to use the LSTM model when available.

See also Self::new_auto.

✨ Enabled with the compiled_data and auto Cargo features.

πŸ“š Help choosing a constructor

source

pub fn try_new_auto_with_options_with_any_provider( provider: &(impl AnyProvider + ?Sized), options: LineBreakOptions<'_>, ) -> Result<Self, DataError>

A version of [Self :: new_auto_with_options] that uses custom data provided by an AnyProvider.

πŸ“š Help choosing a constructor

source

pub fn try_new_auto_with_options_with_buffer_provider( provider: &(impl BufferProvider + ?Sized), options: LineBreakOptions<'_>, ) -> Result<Self, DataError>

A version of [Self :: new_auto_with_options] that uses custom data provided by a BufferProvider.

✨ Enabled with the serde feature.

πŸ“š Help choosing a constructor

source

pub fn try_new_auto_with_options_unstable<D>( provider: &D, options: LineBreakOptions<'_>, ) -> Result<Self, DataError>
where D: DataProvider<LineBreakDataV2Marker> + DataProvider<LstmForWordLineAutoV1Marker> + DataProvider<GraphemeClusterBreakDataV2Marker> + ?Sized,

A version of Self::new_auto_with_options that uses custom data provided by a DataProvider.

πŸ“š Help choosing a constructor

⚠️ The bounds on provider may change over time, including in SemVer minor releases.
source

pub fn new_lstm_with_options(options: LineBreakOptions<'_>) -> Self

Constructs a LineSegmenter with an invariant locale, custom LineBreakOptions, and compiled LSTM data for complex scripts (Khmer, Lao, Myanmar, and Thai).

The LSTM, or Long Term Short Memory, is a machine learning model. It is smaller than the full dictionary but more expensive during segmentation (inference).

See also Self::new_dictionary.

✨ Enabled with the compiled_data and lstm Cargo features.

πŸ“š Help choosing a constructor

source

pub fn try_new_lstm_with_options_with_any_provider( provider: &(impl AnyProvider + ?Sized), options: LineBreakOptions<'_>, ) -> Result<Self, DataError>

A version of [Self :: try_new_lstm_with_options] that uses custom data provided by an AnyProvider.

πŸ“š Help choosing a constructor

source

pub fn try_new_lstm_with_options_with_buffer_provider( provider: &(impl BufferProvider + ?Sized), options: LineBreakOptions<'_>, ) -> Result<Self, DataError>

A version of [Self :: try_new_lstm_with_options] that uses custom data provided by a BufferProvider.

✨ Enabled with the serde feature.

πŸ“š Help choosing a constructor

source

pub fn try_new_lstm_with_options_unstable<D>( provider: &D, options: LineBreakOptions<'_>, ) -> Result<Self, DataError>
where D: DataProvider<LineBreakDataV2Marker> + DataProvider<LstmForWordLineAutoV1Marker> + DataProvider<GraphemeClusterBreakDataV2Marker> + ?Sized,

A version of Self::new_lstm_with_options that uses custom data provided by a DataProvider.

πŸ“š Help choosing a constructor

⚠️ The bounds on provider may change over time, including in SemVer minor releases.
source

pub fn new_dictionary_with_options(options: LineBreakOptions<'_>) -> Self

Constructs a LineSegmenter with an invariant locale, custom LineBreakOptions, and compiled dictionary data for complex scripts (Khmer, Lao, Myanmar, and Thai).

The dictionary model uses a list of words to determine appropriate breakpoints. It is faster than the LSTM model but requires more data.

See also Self::new_dictionary.

✨ Enabled with the compiled_data Cargo feature.

πŸ“š Help choosing a constructor

source

pub fn try_new_dictionary_with_options_with_any_provider( provider: &(impl AnyProvider + ?Sized), options: LineBreakOptions<'_>, ) -> Result<Self, DataError>

A version of [Self :: new_dictionary_with_options] that uses custom data provided by an AnyProvider.

πŸ“š Help choosing a constructor

source

pub fn try_new_dictionary_with_options_with_buffer_provider( provider: &(impl BufferProvider + ?Sized), options: LineBreakOptions<'_>, ) -> Result<Self, DataError>

A version of [Self :: new_dictionary_with_options] that uses custom data provided by a BufferProvider.

✨ Enabled with the serde feature.

πŸ“š Help choosing a constructor

source

pub fn try_new_dictionary_with_options_unstable<D>( provider: &D, options: LineBreakOptions<'_>, ) -> Result<Self, DataError>

A version of Self::new_dictionary_with_options that uses custom data provided by a DataProvider.

πŸ“š Help choosing a constructor

⚠️ The bounds on provider may change over time, including in SemVer minor releases.
source

pub fn segment_str<'l, 's>( &'l self, input: &'s str, ) -> LineBreakIteratorUtf8<'l, 's>

Creates a line break iterator for an str (a UTF-8 string).

There are always breakpoints at 0 and the string length, or only at 0 for the empty string.

source

pub fn segment_utf8<'l, 's>( &'l self, input: &'s [u8], ) -> LineBreakIteratorPotentiallyIllFormedUtf8<'l, 's>

Creates a line break iterator for a potentially ill-formed UTF8 string

Invalid characters are treated as REPLACEMENT CHARACTER

There are always breakpoints at 0 and the string length, or only at 0 for the empty string.

source

pub fn segment_latin1<'l, 's>( &'l self, input: &'s [u8], ) -> LineBreakIteratorLatin1<'l, 's>

Creates a line break iterator for a Latin-1 (8-bit) string.

There are always breakpoints at 0 and the string length, or only at 0 for the empty string.

source

pub fn segment_utf16<'l, 's>( &'l self, input: &'s [u16], ) -> LineBreakIteratorUtf16<'l, 's>

Creates a line break iterator for a UTF-16 string.

There are always breakpoints at 0 and the string length, or only at 0 for the empty string.

Trait Implementations§

source§

impl Debug for LineSegmenter

source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more

Auto Trait Implementations§

Blanket Implementations§

source§

impl<T> Any for T
where T: 'static + ?Sized,

source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
source§

impl<T> Borrow<T> for T
where T: ?Sized,

source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
source§

impl<T> From<T> for T

source§

fn from(t: T) -> T

Returns the argument unchanged.

source§

impl<T, U> Into<U> for T
where U: From<T>,

source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

source§

impl<T> IntoEither for T

source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

source§

type Error = Infallible

The type returned in the event of a conversion error.
source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
Β§

impl<T> ErasedDestructor for T
where T: 'static,

Β§

impl<T> MaybeSendSync for T
where T: Send + Sync,