Struct icu::segmenter::WordSegmenter
source · pub struct WordSegmenter { /* private fields */ }
Expand description
Supports loading word break data, and creating word break iterators for different string encodings.
§Examples
Segment a string:
use icu::segmenter::WordSegmenter;
let segmenter = WordSegmenter::new_auto();
let breakpoints: Vec<usize> =
segmenter.segment_str("Hello World").collect();
assert_eq!(&breakpoints, &[0, 5, 6, 11]);
Segment a Latin1 byte string:
use icu::segmenter::WordSegmenter;
let segmenter = WordSegmenter::new_auto();
let breakpoints: Vec<usize> =
segmenter.segment_latin1(b"Hello World").collect();
assert_eq!(&breakpoints, &[0, 5, 6, 11]);
Successive boundaries can be used to retrieve the segments. In particular, the first boundary is always 0, and the last one is the length of the segmented text in code units.
use itertools::Itertools;
let text = "Mark’d ye his words?";
let segments: Vec<&str> = segmenter
.segment_str(text)
.tuple_windows()
.map(|(i, j)| &text[i..j])
.collect();
assert_eq!(
&segments,
&["Mark’d", " ", "ye", " ", "his", " ", "words", "?"]
);
Not all segments delimited by word boundaries are words; some are interword
segments such as spaces and punctuation.
The WordBreakIterator::word_type()
of a boundary can be used to
classify the preceding segment; WordBreakIterator::iter_with_word_type()
associates each boundary with its status.
let words: Vec<&str> = segmenter
.segment_str(text)
.iter_with_word_type()
.tuple_windows()
.filter(|(_, (_, segment_type))| segment_type.is_word_like())
.map(|((i, _), (j, _))| &text[i..j])
.collect();
assert_eq!(&words, &["Mark’d", "ye", "his", "words"]);
Implementations§
source§impl WordSegmenter
impl WordSegmenter
sourcepub fn new_auto() -> WordSegmenter
pub fn new_auto() -> WordSegmenter
Constructs a WordSegmenter
with an invariant locale and the best available compiled data for
complex scripts (Chinese, Japanese, Khmer, Lao, Myanmar, and Thai).
The current behavior, which is subject to change, is to use the LSTM model when available and the dictionary model for Chinese and Japanese.
✨ Enabled with the compiled_data
and auto
Cargo features.
§Examples
Behavior with complex scripts:
use icu::segmenter::WordSegmenter;
let th_str = "ทุกสองสัปดาห์";
let ja_str = "こんにちは世界";
let segmenter = WordSegmenter::new_auto();
let th_bps = segmenter.segment_str(th_str).collect::<Vec<_>>();
let ja_bps = segmenter.segment_str(ja_str).collect::<Vec<_>>();
assert_eq!(th_bps, [0, 9, 18, 39]);
assert_eq!(ja_bps, [0, 15, 21]);
sourcepub fn try_new_auto_with_any_provider(
provider: &(impl AnyProvider + ?Sized),
) -> Result<WordSegmenter, DataError>
pub fn try_new_auto_with_any_provider( provider: &(impl AnyProvider + ?Sized), ) -> Result<WordSegmenter, DataError>
A version of [Self :: new_auto
] that uses custom data provided by an AnyProvider
.
sourcepub fn try_new_auto_with_buffer_provider(
provider: &(impl BufferProvider + ?Sized),
) -> Result<WordSegmenter, DataError>
pub fn try_new_auto_with_buffer_provider( provider: &(impl BufferProvider + ?Sized), ) -> Result<WordSegmenter, DataError>
A version of [Self :: new_auto
] that uses custom data provided by a BufferProvider
.
✨ Enabled with the serde
feature.
sourcepub fn try_new_auto_unstable<D>(
provider: &D,
) -> Result<WordSegmenter, DataError>
pub fn try_new_auto_unstable<D>( provider: &D, ) -> Result<WordSegmenter, DataError>
A version of Self::new_auto
that uses custom data provided by a DataProvider
.
sourcepub fn try_new_auto_with_options(
options: WordBreakOptions<'_>,
) -> Result<WordSegmenter, DataError>
pub fn try_new_auto_with_options( options: WordBreakOptions<'_>, ) -> Result<WordSegmenter, DataError>
✨ Enabled with the compiled_data
Cargo feature.
sourcepub fn try_new_auto_with_options_with_any_provider(
provider: &(impl AnyProvider + ?Sized),
options: WordBreakOptions<'_>,
) -> Result<WordSegmenter, DataError>
pub fn try_new_auto_with_options_with_any_provider( provider: &(impl AnyProvider + ?Sized), options: WordBreakOptions<'_>, ) -> Result<WordSegmenter, DataError>
A version of [Self :: try_new_auto_with_options
] that uses custom data provided by an AnyProvider
.
sourcepub fn try_new_auto_with_options_with_buffer_provider(
provider: &(impl BufferProvider + ?Sized),
options: WordBreakOptions<'_>,
) -> Result<WordSegmenter, DataError>
pub fn try_new_auto_with_options_with_buffer_provider( provider: &(impl BufferProvider + ?Sized), options: WordBreakOptions<'_>, ) -> Result<WordSegmenter, DataError>
A version of [Self :: try_new_auto_with_options
] that uses custom data provided by a BufferProvider
.
✨ Enabled with the serde
feature.
sourcepub fn try_new_auto_with_options_unstable<D>(
provider: &D,
options: WordBreakOptions<'_>,
) -> Result<WordSegmenter, DataError>
pub fn try_new_auto_with_options_unstable<D>( provider: &D, options: WordBreakOptions<'_>, ) -> Result<WordSegmenter, DataError>
A version of Self::new_auto
that uses custom data provided by a DataProvider
.
sourcepub fn new_lstm() -> WordSegmenter
pub fn new_lstm() -> WordSegmenter
Constructs a WordSegmenter
with an invariant locale and compiled LSTM data for
complex scripts (Burmese, Khmer, Lao, and Thai).
The LSTM, or Long Term Short Memory, is a machine learning model. It is smaller than the full dictionary but more expensive during segmentation (inference).
Warning: there is not currently an LSTM model for Chinese or Japanese, so the WordSegmenter
created by this function will have unexpected behavior in spans of those scripts.
✨ Enabled with the compiled_data
and lstm
Cargo features.
§Examples
Behavior with complex scripts:
use icu::segmenter::WordSegmenter;
let th_str = "ทุกสองสัปดาห์";
let ja_str = "こんにちは世界";
let segmenter = WordSegmenter::new_lstm();
let th_bps = segmenter.segment_str(th_str).collect::<Vec<_>>();
let ja_bps = segmenter.segment_str(ja_str).collect::<Vec<_>>();
assert_eq!(th_bps, [0, 9, 18, 39]);
// Note: We aren't able to find a suitable breakpoint in Chinese/Japanese.
assert_eq!(ja_bps, [0, 21]);
sourcepub fn try_new_lstm_with_any_provider(
provider: &(impl AnyProvider + ?Sized),
) -> Result<WordSegmenter, DataError>
pub fn try_new_lstm_with_any_provider( provider: &(impl AnyProvider + ?Sized), ) -> Result<WordSegmenter, DataError>
A version of [Self :: new_lstm
] that uses custom data provided by an AnyProvider
.
sourcepub fn try_new_lstm_with_buffer_provider(
provider: &(impl BufferProvider + ?Sized),
) -> Result<WordSegmenter, DataError>
pub fn try_new_lstm_with_buffer_provider( provider: &(impl BufferProvider + ?Sized), ) -> Result<WordSegmenter, DataError>
A version of [Self :: new_lstm
] that uses custom data provided by a BufferProvider
.
✨ Enabled with the serde
feature.
sourcepub fn try_new_lstm_unstable<D>(
provider: &D,
) -> Result<WordSegmenter, DataError>
pub fn try_new_lstm_unstable<D>( provider: &D, ) -> Result<WordSegmenter, DataError>
A version of Self::new_lstm
that uses custom data provided by a DataProvider
.
sourcepub fn try_new_lstm_with_options(
options: WordBreakOptions<'_>,
) -> Result<WordSegmenter, DataError>
pub fn try_new_lstm_with_options( options: WordBreakOptions<'_>, ) -> Result<WordSegmenter, DataError>
✨ Enabled with the compiled_data
Cargo feature.
sourcepub fn try_new_lstm_with_options_with_any_provider(
provider: &(impl AnyProvider + ?Sized),
options: WordBreakOptions<'_>,
) -> Result<WordSegmenter, DataError>
pub fn try_new_lstm_with_options_with_any_provider( provider: &(impl AnyProvider + ?Sized), options: WordBreakOptions<'_>, ) -> Result<WordSegmenter, DataError>
A version of [Self :: try_new_lstm_with_options
] that uses custom data provided by an AnyProvider
.
sourcepub fn try_new_lstm_with_options_with_buffer_provider(
provider: &(impl BufferProvider + ?Sized),
options: WordBreakOptions<'_>,
) -> Result<WordSegmenter, DataError>
pub fn try_new_lstm_with_options_with_buffer_provider( provider: &(impl BufferProvider + ?Sized), options: WordBreakOptions<'_>, ) -> Result<WordSegmenter, DataError>
A version of [Self :: try_new_lstm_with_options
] that uses custom data provided by a BufferProvider
.
✨ Enabled with the serde
feature.
sourcepub fn try_new_lstm_with_options_unstable<D>(
provider: &D,
options: WordBreakOptions<'_>,
) -> Result<WordSegmenter, DataError>
pub fn try_new_lstm_with_options_unstable<D>( provider: &D, options: WordBreakOptions<'_>, ) -> Result<WordSegmenter, DataError>
A version of Self::new_lstm
that uses custom data provided by a DataProvider
.
sourcepub fn new_dictionary() -> WordSegmenter
pub fn new_dictionary() -> WordSegmenter
Construct a WordSegmenter
with an invariant locale and compiled dictionary data for
complex scripts (Chinese, Japanese, Khmer, Lao, Myanmar, and Thai).
The dictionary model uses a list of words to determine appropriate breakpoints. It is faster than the LSTM model but requires more data.
✨ Enabled with the compiled_data
Cargo feature.
§Examples
Behavior with complex scripts:
use icu::segmenter::WordSegmenter;
let th_str = "ทุกสองสัปดาห์";
let ja_str = "こんにちは世界";
let segmenter = WordSegmenter::new_dictionary();
let th_bps = segmenter.segment_str(th_str).collect::<Vec<_>>();
let ja_bps = segmenter.segment_str(ja_str).collect::<Vec<_>>();
assert_eq!(th_bps, [0, 9, 18, 39]);
assert_eq!(ja_bps, [0, 15, 21]);
sourcepub fn try_new_dictionary_with_any_provider(
provider: &(impl AnyProvider + ?Sized),
) -> Result<WordSegmenter, DataError>
pub fn try_new_dictionary_with_any_provider( provider: &(impl AnyProvider + ?Sized), ) -> Result<WordSegmenter, DataError>
A version of [Self :: new_dictionary
] that uses custom data provided by an AnyProvider
.
sourcepub fn try_new_dictionary_with_buffer_provider(
provider: &(impl BufferProvider + ?Sized),
) -> Result<WordSegmenter, DataError>
pub fn try_new_dictionary_with_buffer_provider( provider: &(impl BufferProvider + ?Sized), ) -> Result<WordSegmenter, DataError>
A version of [Self :: new_dictionary
] that uses custom data provided by a BufferProvider
.
✨ Enabled with the serde
feature.
sourcepub fn try_new_dictionary_unstable<D>(
provider: &D,
) -> Result<WordSegmenter, DataError>
pub fn try_new_dictionary_unstable<D>( provider: &D, ) -> Result<WordSegmenter, DataError>
A version of Self::new_dictionary
that uses custom data provided by a DataProvider
.
sourcepub fn try_new_dictionary_with_options(
options: WordBreakOptions<'_>,
) -> Result<WordSegmenter, DataError>
pub fn try_new_dictionary_with_options( options: WordBreakOptions<'_>, ) -> Result<WordSegmenter, DataError>
✨ Enabled with the compiled_data
Cargo feature.
sourcepub fn try_new_dictionary_with_options_with_any_provider(
provider: &(impl AnyProvider + ?Sized),
options: WordBreakOptions<'_>,
) -> Result<WordSegmenter, DataError>
pub fn try_new_dictionary_with_options_with_any_provider( provider: &(impl AnyProvider + ?Sized), options: WordBreakOptions<'_>, ) -> Result<WordSegmenter, DataError>
A version of [Self :: try_new_dictionary_with_options
] that uses custom data provided by an AnyProvider
.
sourcepub fn try_new_dictionary_with_options_with_buffer_provider(
provider: &(impl BufferProvider + ?Sized),
options: WordBreakOptions<'_>,
) -> Result<WordSegmenter, DataError>
pub fn try_new_dictionary_with_options_with_buffer_provider( provider: &(impl BufferProvider + ?Sized), options: WordBreakOptions<'_>, ) -> Result<WordSegmenter, DataError>
A version of [Self :: try_new_dictionary_with_options
] that uses custom data provided by a BufferProvider
.
✨ Enabled with the serde
feature.
sourcepub fn try_new_dictionary_with_options_unstable<D>(
provider: &D,
options: WordBreakOptions<'_>,
) -> Result<WordSegmenter, DataError>
pub fn try_new_dictionary_with_options_unstable<D>( provider: &D, options: WordBreakOptions<'_>, ) -> Result<WordSegmenter, DataError>
A version of Self::new_dictionary
that uses custom data provided by a DataProvider
.
sourcepub fn segment_str<'l, 's>(
&'l self,
input: &'s str,
) -> WordBreakIterator<'l, 's, WordBreakTypeUtf8> ⓘ
pub fn segment_str<'l, 's>( &'l self, input: &'s str, ) -> WordBreakIterator<'l, 's, WordBreakTypeUtf8> ⓘ
Creates a word break iterator for an str
(a UTF-8 string).
There are always breakpoints at 0 and the string length, or only at 0 for the empty string.
sourcepub fn segment_utf8<'l, 's>(
&'l self,
input: &'s [u8],
) -> WordBreakIterator<'l, 's, WordBreakTypePotentiallyIllFormedUtf8> ⓘ
pub fn segment_utf8<'l, 's>( &'l self, input: &'s [u8], ) -> WordBreakIterator<'l, 's, WordBreakTypePotentiallyIllFormedUtf8> ⓘ
Creates a word break iterator for a potentially ill-formed UTF8 string
Invalid characters are treated as REPLACEMENT CHARACTER
There are always breakpoints at 0 and the string length, or only at 0 for the empty string.
sourcepub fn segment_latin1<'l, 's>(
&'l self,
input: &'s [u8],
) -> WordBreakIterator<'l, 's, RuleBreakTypeLatin1> ⓘ
pub fn segment_latin1<'l, 's>( &'l self, input: &'s [u8], ) -> WordBreakIterator<'l, 's, RuleBreakTypeLatin1> ⓘ
Creates a word break iterator for a Latin-1 (8-bit) string.
There are always breakpoints at 0 and the string length, or only at 0 for the empty string.
sourcepub fn segment_utf16<'l, 's>(
&'l self,
input: &'s [u16],
) -> WordBreakIterator<'l, 's, WordBreakTypeUtf16> ⓘ
pub fn segment_utf16<'l, 's>( &'l self, input: &'s [u16], ) -> WordBreakIterator<'l, 's, WordBreakTypeUtf16> ⓘ
Creates a word break iterator for a UTF-16 string.
There are always breakpoints at 0 and the string length, or only at 0 for the empty string.
Trait Implementations§
Auto Trait Implementations§
impl Freeze for WordSegmenter
impl RefUnwindSafe for WordSegmenter
impl Send for WordSegmenter
impl Sync for WordSegmenter
impl Unpin for WordSegmenter
impl UnwindSafe for WordSegmenter
Blanket Implementations§
source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
source§impl<T> IntoEither for T
impl<T> IntoEither for T
source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self
into a Left
variant of Either<Self, Self>
if into_left
is true
.
Converts self
into a Right
variant of Either<Self, Self>
otherwise. Read moresource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self
into a Left
variant of Either<Self, Self>
if into_left(&self)
returns true
.
Converts self
into a Right
variant of Either<Self, Self>
otherwise. Read more