Struct icu::segmenter::SentenceSegmenter

source ·
pub struct SentenceSegmenter { /* private fields */ }
Expand description

Supports loading sentence break data, and creating sentence break iterators for different string encodings.

§Examples

Segment a string:

use icu::segmenter::SentenceSegmenter;
let segmenter = SentenceSegmenter::new();

let breakpoints: Vec<usize> =
    segmenter.segment_str("Hello World").collect();
assert_eq!(&breakpoints, &[0, 11]);

Segment a Latin1 byte string:

use icu::segmenter::SentenceSegmenter;
let segmenter = SentenceSegmenter::new();

let breakpoints: Vec<usize> =
    segmenter.segment_latin1(b"Hello World").collect();
assert_eq!(&breakpoints, &[0, 11]);

Successive boundaries can be used to retrieve the sentences. In particular, the first boundary is always 0, and the last one is the length of the segmented text in code units.

use itertools::Itertools;
let text = "Ceci tuera cela. Le livre tuera l’édifice.";
let sentences: Vec<&str> = segmenter
    .segment_str(text)
    .tuple_windows()
    .map(|(i, j)| &text[i..j])
    .collect();
assert_eq!(
    &sentences,
    &["Ceci tuera cela. ", "Le livre tuera l’édifice."]
);

Implementations§

source§

impl SentenceSegmenter

source

pub fn new() -> SentenceSegmenter

Constructs a SentenceSegmenter with an invariant locale and compiled data.

Enabled with the compiled_data Cargo feature.

📚 Help choosing a constructor

source

pub fn try_new_with_any_provider( provider: &(impl AnyProvider + ?Sized), ) -> Result<SentenceSegmenter, DataError>

A version of [Self :: new] that uses custom data provided by an AnyProvider.

📚 Help choosing a constructor

source

pub fn try_new_with_buffer_provider( provider: &(impl BufferProvider + ?Sized), ) -> Result<SentenceSegmenter, DataError>

A version of [Self :: new] that uses custom data provided by a BufferProvider.

Enabled with the serde feature.

📚 Help choosing a constructor

source

pub fn try_new_unstable<D>(provider: &D) -> Result<SentenceSegmenter, DataError>

A version of Self::new that uses custom data provided by a DataProvider.

📚 Help choosing a constructor

⚠️ The bounds on provider may change over time, including in SemVer minor releases.
source

pub fn try_new_with_options( options: SentenceBreakOptions<'_>, ) -> Result<SentenceSegmenter, DataError>

Constructs a SentenceSegmenter for a given options and using compiled data.

Enabled with the compiled_data Cargo feature.

📚 Help choosing a constructor

source

pub fn try_new_with_options_with_any_provider( provider: &(impl AnyProvider + ?Sized), options: SentenceBreakOptions<'_>, ) -> Result<SentenceSegmenter, DataError>

A version of [Self :: try_new_with_options] that uses custom data provided by an AnyProvider.

📚 Help choosing a constructor

source

pub fn try_new_with_options_with_buffer_provider( provider: &(impl BufferProvider + ?Sized), options: SentenceBreakOptions<'_>, ) -> Result<SentenceSegmenter, DataError>

A version of [Self :: try_new_with_options] that uses custom data provided by a BufferProvider.

Enabled with the serde feature.

📚 Help choosing a constructor

source

pub fn try_new_with_options_unstable<D>( provider: &D, options: SentenceBreakOptions<'_>, ) -> Result<SentenceSegmenter, DataError>

A version of Self::try_new_with_options that uses custom data provided by a DataProvider.

📚 Help choosing a constructor

⚠️ The bounds on provider may change over time, including in SemVer minor releases.
source

pub fn segment_str<'l, 's>( &'l self, input: &'s str, ) -> SentenceBreakIterator<'l, 's, RuleBreakTypeUtf8>

Creates a sentence break iterator for an str (a UTF-8 string).

There are always breakpoints at 0 and the string length, or only at 0 for the empty string.

source

pub fn segment_utf8<'l, 's>( &'l self, input: &'s [u8], ) -> SentenceBreakIterator<'l, 's, RuleBreakTypePotentiallyIllFormedUtf8>

Creates a sentence break iterator for a potentially ill-formed UTF8 string

Invalid characters are treated as REPLACEMENT CHARACTER

There are always breakpoints at 0 and the string length, or only at 0 for the empty string.

source

pub fn segment_latin1<'l, 's>( &'l self, input: &'s [u8], ) -> SentenceBreakIterator<'l, 's, RuleBreakTypeLatin1>

Creates a sentence break iterator for a Latin-1 (8-bit) string.

There are always breakpoints at 0 and the string length, or only at 0 for the empty string.

source

pub fn segment_utf16<'l, 's>( &'l self, input: &'s [u16], ) -> SentenceBreakIterator<'l, 's, RuleBreakTypeUtf16>

Creates a sentence break iterator for a UTF-16 string.

There are always breakpoints at 0 and the string length, or only at 0 for the empty string.

Trait Implementations§

source§

impl Debug for SentenceSegmenter

source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result<(), Error>

Formats the value using the given formatter. Read more
source§

impl Default for SentenceSegmenter

source§

fn default() -> SentenceSegmenter

Returns the “default value” for a type. Read more

Auto Trait Implementations§

Blanket Implementations§

source§

impl<T> Any for T
where T: 'static + ?Sized,

source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
source§

impl<T> Borrow<T> for T
where T: ?Sized,

source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
source§

impl<T> From<T> for T

source§

fn from(t: T) -> T

Returns the argument unchanged.

source§

impl<T, U> Into<U> for T
where U: From<T>,

source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

source§

impl<T> IntoEither for T

source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

source§

type Error = Infallible

The type returned in the event of a conversion error.
source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
source§

impl<T> ErasedDestructor for T
where T: 'static,

source§

impl<T> MaybeSendSync for T
where T: Send + Sync,