public class DefaultICUTokenizerConfig extends ICUTokenizerConfig
ICUTokenizerConfig
that is generally applicable
to many languages.
Generally tokenizes Unicode text according to UAX#29
(BreakIterator.getWordInstance(ULocale.ROOT)
),
but with the following tailorings:
Modifier and Type | Field and Description |
---|---|
static String |
WORD_EMOJI
Token type for words that appear to be emoji sequences
|
static String |
WORD_HANGUL
Token type for words containing Korean hangul
|
static String |
WORD_HIRAGANA
Token type for words containing Japanese hiragana
|
static String |
WORD_IDEO
Token type for words containing ideographic characters
|
static String |
WORD_KATAKANA
Token type for words containing Japanese katakana
|
static String |
WORD_LETTER
Token type for words that contain letters
|
static String |
WORD_NUMBER
Token type for words that appear to be numbers
|
EMOJI_SEQUENCE_STATUS
Constructor and Description |
---|
DefaultICUTokenizerConfig(boolean cjkAsWords,
boolean myanmarAsWords)
Creates a new config.
|
Modifier and Type | Method and Description |
---|---|
boolean |
combineCJ()
true if Han, Hiragana, and Katakana scripts should all be returned as Japanese
|
com.ibm.icu.text.RuleBasedBreakIterator |
getBreakIterator(int script)
Return a breakiterator capable of processing a given script.
|
String |
getType(int script,
int ruleStatus)
Return a token type value for a given script and BreakIterator
rule status.
|
public static final String WORD_IDEO
public static final String WORD_HIRAGANA
public static final String WORD_KATAKANA
public static final String WORD_HANGUL
public static final String WORD_LETTER
public static final String WORD_NUMBER
public static final String WORD_EMOJI
public DefaultICUTokenizerConfig(boolean cjkAsWords, boolean myanmarAsWords)
cjkAsWords
- true if cjk text should undergo dictionary-based segmentation,
otherwise text will be segmented according to UAX#29 defaults.
If this is true, all Han+Hiragana+Katakana words will be tagged as
IDEOGRAPHIC.myanmarAsWords
- true if Myanmar text should undergo dictionary-based segmentation,
otherwise it will be tokenized as syllables.public boolean combineCJ()
ICUTokenizerConfig
combineCJ
in class ICUTokenizerConfig
public com.ibm.icu.text.RuleBasedBreakIterator getBreakIterator(int script)
ICUTokenizerConfig
getBreakIterator
in class ICUTokenizerConfig
public String getType(int script, int ruleStatus)
ICUTokenizerConfig
getType
in class ICUTokenizerConfig
Copyright © 2000-2021 Apache Software Foundation. All Rights Reserved.