public final class ICUTokenizer extends Tokenizer
Words are broken across script boundaries, then segmented according to
the BreakIterator and typing provided by the ICUTokenizerConfig
ICUTokenizerConfig
AttributeSource.State
DEFAULT_TOKEN_ATTRIBUTE_FACTORY
Constructor and Description |
---|
ICUTokenizer()
Construct a new ICUTokenizer that breaks text into words from the given
Reader.
|
ICUTokenizer(AttributeFactory factory,
ICUTokenizerConfig config)
Construct a new ICUTokenizer that breaks text into words from the given
Reader, using a tailored BreakIterator configuration.
|
ICUTokenizer(ICUTokenizerConfig config)
Construct a new ICUTokenizer that breaks text into words from the given
Reader, using a tailored BreakIterator configuration.
|
Modifier and Type | Method and Description |
---|---|
void |
end() |
boolean |
incrementToken() |
void |
reset() |
close, correctOffset, setReader
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
public ICUTokenizer()
The default script-specific handling is used.
The default attribute factory is used.
DefaultICUTokenizerConfig
public ICUTokenizer(ICUTokenizerConfig config)
The default attribute factory is used.
config
- Tailored BreakIterator configurationpublic ICUTokenizer(AttributeFactory factory, ICUTokenizerConfig config)
factory
- AttributeFactory to useconfig
- Tailored BreakIterator configurationpublic boolean incrementToken() throws IOException
incrementToken
in class TokenStream
IOException
public void reset() throws IOException
reset
in class Tokenizer
IOException
public void end() throws IOException
end
in class TokenStream
IOException
Copyright © 2000-2021 Apache Software Foundation. All Rights Reserved.