public abstract class BreakIterator extends Object implements Cloneable
java.text.BreakIterator
. Methods, fields, and other functionality specific to ICU are labeled '[icu]'.
A class that locates boundaries in text. This class defines a protocol for objects that break up a piece of natural-language text according to a set of criteria. Instances or subclasses of BreakIterator can be provided, for example, to break a piece of text into words, sentences, or logical characters according to the conventions of some language or group of languages. We provide five built-in types of BreakIterator:
BreakIterator's interface follows an "iterator" model (hence the name), meaning it has a concept of a "current position" and methods like first(), last(), next(), and previous() that update the current position. All BreakIterators uphold the following invariants:
Examples:
Creating and using text boundaries
Print each element in orderpublic static void main(String args[]) { if (args.length == 1) { String stringToExamine = args[0]; //print each word in order BreakIterator boundary = BreakIterator.getWordInstance(); boundary.setText(stringToExamine); printEachForward(boundary, stringToExamine); //print each sentence in reverse order boundary = BreakIterator.getSentenceInstance(Locale.US); boundary.setText(stringToExamine); printEachBackward(boundary, stringToExamine); printFirst(boundary, stringToExamine); printLast(boundary, stringToExamine); } }
Print each element in reverse orderpublic static void printEachForward(BreakIterator boundary, String source) { int start = boundary.first(); for (int end = boundary.next(); end != BreakIterator.DONE; start = end, end = boundary.next()) { System.out.println(source.substring(start,end)); } }
Print first elementpublic static void printEachBackward(BreakIterator boundary, String source) { int end = boundary.last(); for (int start = boundary.previous(); start != BreakIterator.DONE; end = start, start = boundary.previous()) { System.out.println(source.substring(start,end)); } }
Print last elementpublic static void printFirst(BreakIterator boundary, String source) { int start = boundary.first(); int end = boundary.next(); System.out.println(source.substring(start,end)); }
Print the element at a specified positionpublic static void printLast(BreakIterator boundary, String source) { int end = boundary.last(); int start = boundary.previous(); System.out.println(source.substring(start,end)); }
Find the next wordpublic static void printAt(BreakIterator boundary, int pos, String source) { int end = boundary.following(pos); int start = boundary.previous(); System.out.println(source.substring(start,end)); }
public static int nextWordStartAfter(int pos, String text) { BreakIterator wb = BreakIterator.getWordInstance(); wb.setText(text); int wordStart = wb.following(pos); for (;;) { int wordLimit = wb.next(); if (wordLimit == BreakIterator.DONE) { return BreakIterator.DONE; } int wordStatus = wb.getRuleStatus(); if (wordStatus != BreakIterator.WORD_NONE) { return wordStart; } wordStart = wordLimit; } }The iterator returned bygetWordInstance()
is unique in that the break positions it returns don't represent both the start and end of the thing being iterated over. That is, a sentence-break iterator returns breaks that each represent the end of one sentence and the beginning of the next. With the word-break iterator, the characters between two boundaries might be a word, or they might be the punctuation or whitespace between two words. The above code usesgetRuleStatus()
to identify and ignore boundaries associated with punctuation or other non-word characters.
CharacterIterator
Modifier and Type | Field and Description |
---|---|
static int |
DONE
DONE is returned by previous() and next() after all valid
boundaries have been returned.
|
static int |
KIND_CHARACTER
[icu]
|
static int |
KIND_LINE
[icu]
|
static int |
KIND_SENTENCE
[icu]
|
static int |
KIND_TITLE
Deprecated.
ICU 64 Use
getWordInstance() instead. |
static int |
KIND_WORD
[icu]
|
static int |
WORD_IDEO
Tag value for words containing ideographic characters, lower limit
|
static int |
WORD_IDEO_LIMIT
Tag value for words containing ideographic characters, upper limit
|
static int |
WORD_KANA
Tag value for words containing kana characters, lower limit
|
static int |
WORD_KANA_LIMIT
Tag value for words containing kana characters, upper limit
|
static int |
WORD_LETTER
Tag value for words that contain letters, excluding
hiragana, katakana or ideographic characters, lower limit.
|
static int |
WORD_LETTER_LIMIT
Tag value for words containing letters, upper limit
|
static int |
WORD_NONE
Tag value for "words" that do not fit into any of other categories.
|
static int |
WORD_NONE_LIMIT
Upper bound for tags for uncategorized words.
|
static int |
WORD_NUMBER
Tag value for words that appear to be numbers, lower limit.
|
static int |
WORD_NUMBER_LIMIT
Tag value for words that appear to be numbers, upper limit.
|
Modifier | Constructor and Description |
---|---|
protected |
BreakIterator()
Default constructor.
|
Modifier and Type | Method and Description |
---|---|
Object |
clone()
Clone method.
|
abstract int |
current()
Return the iterator's current position.
|
abstract int |
first()
Set the iterator to the first boundary position.
|
abstract int |
following(int offset)
Sets the iterator's current iteration position to be the first
boundary position following the specified position.
|
static Locale[] |
getAvailableLocales()
Returns a list of locales for which BreakIterators can be used.
|
static ULocale[] |
getAvailableULocales()
[icu] Returns a list of locales for which BreakIterators can be used.
|
static BreakIterator |
getBreakInstance(ULocale where,
int kind)
Deprecated.
This API is ICU internal only.
|
static BreakIterator |
getCharacterInstance()
Returns a new instance of BreakIterator that locates logical-character
boundaries.
|
static BreakIterator |
getCharacterInstance(Locale where)
Returns a new instance of BreakIterator that locates logical-character
boundaries.
|
static BreakIterator |
getCharacterInstance(ULocale where)
[icu] Returns a new instance of BreakIterator that locates logical-character
boundaries.
|
static BreakIterator |
getLineInstance()
Returns a new instance of BreakIterator that locates legal line-
wrapping positions.
|
static BreakIterator |
getLineInstance(Locale where)
Returns a new instance of BreakIterator that locates legal line-
wrapping positions.
|
static BreakIterator |
getLineInstance(ULocale where)
[icu] Returns a new instance of BreakIterator that locates legal line-
wrapping positions.
|
ULocale |
getLocale(ULocale.Type type)
[icu] Returns the locale that was used to create this object, or null.
|
int |
getRuleStatus()
For RuleBasedBreakIterators, return the status tag from the
break rule that determined the boundary at the current iteration position.
|
int |
getRuleStatusVec(int[] fillInArray)
For RuleBasedBreakIterators, get the status (tag) values from the break rule(s)
that determined the the boundary at the current iteration position.
|
static BreakIterator |
getSentenceInstance()
Returns a new instance of BreakIterator that locates sentence boundaries.
|
static BreakIterator |
getSentenceInstance(Locale where)
Returns a new instance of BreakIterator that locates sentence boundaries.
|
static BreakIterator |
getSentenceInstance(ULocale where)
[icu] Returns a new instance of BreakIterator that locates sentence boundaries.
|
abstract CharacterIterator |
getText()
Returns a CharacterIterator over the text being analyzed.
|
static BreakIterator |
getTitleInstance()
Deprecated.
ICU 64 Use
getWordInstance() instead. |
static BreakIterator |
getTitleInstance(Locale where)
Deprecated.
ICU 64 Use
getWordInstance() instead. |
static BreakIterator |
getTitleInstance(ULocale where)
Deprecated.
ICU 64 Use
getWordInstance() instead. |
static BreakIterator |
getWordInstance()
Returns a new instance of BreakIterator that locates word boundaries.
|
static BreakIterator |
getWordInstance(Locale where)
Returns a new instance of BreakIterator that locates word boundaries.
|
static BreakIterator |
getWordInstance(ULocale where)
[icu] Returns a new instance of BreakIterator that locates word boundaries.
|
boolean |
isBoundary(int offset)
Return true if the specified position is a boundary position.
|
abstract int |
last()
Set the iterator to the last boundary position.
|
abstract int |
next()
Advances the iterator forward one boundary.
|
abstract int |
next(int n)
Move the iterator by the specified number of steps in the text.
|
int |
preceding(int offset)
Sets the iterator's current iteration position to be the last
boundary position preceding the specified position.
|
abstract int |
previous()
Move the iterator backward one boundary.
|
static Object |
registerInstance(BreakIterator iter,
Locale locale,
int kind)
[icu] Registers a new break iterator of the indicated kind, to use in the given
locale.
|
static Object |
registerInstance(BreakIterator iter,
ULocale locale,
int kind)
[icu] Registers a new break iterator of the indicated kind, to use in the given
locale.
|
abstract void |
setText(CharacterIterator newText)
Sets the iterator to analyze a new piece of text.
|
void |
setText(CharSequence newText)
Sets the iterator to analyze a new piece of text.
|
void |
setText(String newText)
Sets the iterator to analyze a new piece of text.
|
static boolean |
unregister(Object key)
[icu] Unregisters a previously-registered BreakIterator using the key returned
from the register call.
|
public static final int DONE
public static final int WORD_NONE
public static final int WORD_NONE_LIMIT
public static final int WORD_NUMBER
public static final int WORD_NUMBER_LIMIT
public static final int WORD_LETTER
public static final int WORD_LETTER_LIMIT
public static final int WORD_KANA
public static final int WORD_KANA_LIMIT
public static final int WORD_IDEO
public static final int WORD_IDEO_LIMIT
public static final int KIND_CHARACTER
public static final int KIND_WORD
public static final int KIND_LINE
public static final int KIND_SENTENCE
@Deprecated public static final int KIND_TITLE
getWordInstance()
instead.getTitleInstance()
,
getWordInstance()
,
Constant Field Valuesprotected BreakIterator()
public Object clone()
public abstract int first()
public abstract int last()
public abstract int next(int n)
n
- The number of boundaries to advance over (if positive, moves
forward; if negative, moves backwards).public abstract int next()
public abstract int previous()
public abstract int following(int offset)
offset
- The character position to start searching from.public int preceding(int offset)
offset
- The character position to start searching from.public boolean isBoundary(int offset)
offset
- the offset to check.public abstract int current()
public int getRuleStatus()
For break iterator types that do not support a rule status, a default value of 0 is returned.
public int getRuleStatusVec(int[] fillInArray)
For break iterator types that do not support rule status, no values are returned.
If the size of the output array is insufficient to hold the data, the output will be truncated to the available length. No exception will be thrown.
fillInArray
- an array to be filled in with the status values.public abstract CharacterIterator getText()
Caution:The state of the returned CharacterIterator must not be modified in any way while the BreakIterator is still in use. Doing so will lead to undefined behavior of the BreakIterator. Clone the returned CharacterIterator first and work with that.
The returned CharacterIterator is a reference to the actual iterator being used by the BreakIterator. No guarantees are made about the current position of this iterator when it is returned; it may differ from the BreakIterators current position. If you need to move that position to examine the text, clone this function's return value first.
public void setText(String newText)
newText
- A String containing the text to analyze with
this BreakIterator.public void setText(CharSequence newText)
The text underlying the CharSequence must not be be modified while the BreakIterator holds a references to it. (As could possibly occur with a StringBuilder, for example).
newText
- A CharSequence containing the text to analyze with
this BreakIterator.public abstract void setText(CharacterIterator newText)
Caution: The supplied CharacterIterator is used directly by the BreakIterator, and must not be altered in any way by code outside of the BreakIterator. Doing so will lead to undefined behavior of the BreakIterator.
newText
- A CharacterIterator referring to the text
to analyze with this BreakIterator (the iterator's current
position is ignored, but its other state is significant).public static BreakIterator getWordInstance()
public static BreakIterator getWordInstance(Locale where)
where
- A locale specifying the language of the text to be
analyzed.NullPointerException
- if where
is null.public static BreakIterator getWordInstance(ULocale where)
where
- A locale specifying the language of the text to be
analyzed.NullPointerException
- if where
is null.public static BreakIterator getLineInstance()
public static BreakIterator getLineInstance(Locale where)
where
- A Locale specifying the language of the text being broken.NullPointerException
- if where
is null.public static BreakIterator getLineInstance(ULocale where)
where
- A Locale specifying the language of the text being broken.NullPointerException
- if where
is null.public static BreakIterator getCharacterInstance()
public static BreakIterator getCharacterInstance(Locale where)
where
- A Locale specifying the language of the text being analyzed.NullPointerException
- if where
is null.public static BreakIterator getCharacterInstance(ULocale where)
where
- A Locale specifying the language of the text being analyzed.NullPointerException
- if where
is null.public static BreakIterator getSentenceInstance()
public static BreakIterator getSentenceInstance(Locale where)
where
- A Locale specifying the language of the text being analyzed.NullPointerException
- if where
is null.public static BreakIterator getSentenceInstance(ULocale where)
where
- A Locale specifying the language of the text being analyzed.NullPointerException
- if where
is null.@Deprecated public static BreakIterator getTitleInstance()
getWordInstance()
instead.getWordInstance()
@Deprecated public static BreakIterator getTitleInstance(Locale where)
getWordInstance()
instead.getWordInstance()
where
- A Locale specifying the language of the text being analyzed.NullPointerException
- if where
is null.@Deprecated public static BreakIterator getTitleInstance(ULocale where)
getWordInstance()
instead.getWordInstance()
where
- A Locale specifying the language of the text being analyzed.NullPointerException
- if where
is null.public static Object registerInstance(BreakIterator iter, Locale locale, int kind)
Because ICU may choose to cache BreakIterator objects internally, this must be called at application startup, prior to any calls to BreakIterator.getInstance to avoid undefined behavior.
iter
- the BreakIterator instance to adopt.locale
- the Locale for which this instance is to be registeredkind
- the type of iterator for which this instance is to be registeredpublic static Object registerInstance(BreakIterator iter, ULocale locale, int kind)
Because ICU may choose to cache BreakIterator objects internally, this must be called at application startup, prior to any calls to BreakIterator.getInstance to avoid undefined behavior.
iter
- the BreakIterator instance to adopt.locale
- the Locale for which this instance is to be registeredkind
- the type of iterator for which this instance is to be registeredpublic static boolean unregister(Object key)
key
- the registry key returned by a previous call to registerInstance@Deprecated public static BreakIterator getBreakInstance(ULocale where, int kind)
public static Locale[] getAvailableLocales()
public static ULocale[] getAvailableULocales()
public final ULocale getLocale(ULocale.Type type)
Note: The actual locale is returned correctly, but the valid locale is not, in most cases.
type
- type of information requested, either ULocale.VALID_LOCALE
or ULocale.ACTUAL_LOCALE
.ULocale
,
ULocale.VALID_LOCALE
,
ULocale.ACTUAL_LOCALE
Copyright © 2016 Unicode, Inc. and others.