DefaultICUTokenizerConfig (Lucene 8.9.0 API)乐学网一站式学习平台

java.lang.Object
- org.apache.lucene.analysis.icu.segmentation.ICUTokenizerConfig
- - org.apache.lucene.analysis.icu.segmentation.DefaultICUTokenizerConfig

```
public class DefaultICUTokenizerConfig
extends ICUTokenizerConfig
```
Default ICUTokenizerConfig that is generally applicable to many languages.
Generally tokenizes Unicode text according to UAX#29 (BreakIterator.getWordInstance(ULocale.ROOT)), but with the following tailorings:
- Thai, Lao, Myanmar, Khmer, and CJK text is broken into words with a dictionary.
WARNING: This API is experimental and might change in incompatible ways in the next release.

Field Summary

Fields
Modifier and Type	Field and Description
`static String`	`WORD_EMOJI` Token type for words that appear to be emoji sequences
`static String`	`WORD_HANGUL` Token type for words containing Korean hangul
`static String`	`WORD_HIRAGANA` Token type for words containing Japanese hiragana
`static String`	`WORD_IDEO` Token type for words containing ideographic characters
`static String`	`WORD_KATAKANA` Token type for words containing Japanese katakana
`static String`	`WORD_LETTER` Token type for words that contain letters
`static String`	`WORD_NUMBER` Token type for words that appear to be numbers

Fields inherited from class org.apache.lucene.analysis.icu.segmentation.ICUTokenizerConfig
EMOJI_SEQUENCE_STATUS

Constructor Summary

Constructors
Constructor and Description

DefaultICUTokenizerConfig(boolean cjkAsWords, boolean myanmarAsWords)
Creates a new config.

Constructors
Constructor and Description
`DefaultICUTokenizerConfig(boolean cjkAsWords, boolean myanmarAsWords)` Creates a new config.

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`boolean`	`combineCJ()` true if Han, Hiragana, and Katakana scripts should all be returned as Japanese
`com.ibm.icu.text.RuleBasedBreakIterator`	`getBreakIterator(int script)` Return a breakiterator capable of processing a given script.
`String`	`getType(int script, int ruleStatus)` Return a token type value for a given script and BreakIterator rule status.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - WORD_IDEO
```
public static final String WORD_IDEO
```
    Token type for words containing ideographic characters
  - WORD_HIRAGANA
```
public static final String WORD_HIRAGANA
```
    Token type for words containing Japanese hiragana
  - WORD_KATAKANA
```
public static final String WORD_KATAKANA
```
    Token type for words containing Japanese katakana
  - WORD_HANGUL
```
public static final String WORD_HANGUL
```
    Token type for words containing Korean hangul
  - WORD_LETTER
```
public static final String WORD_LETTER
```
    Token type for words that contain letters
  - WORD_NUMBER
```
public static final String WORD_NUMBER
```
    Token type for words that appear to be numbers
  - WORD_EMOJI
```
public static final String WORD_EMOJI
```
    Token type for words that appear to be emoji sequences
- Constructor Detail
  - DefaultICUTokenizerConfig
```
public DefaultICUTokenizerConfig(boolean cjkAsWords,
                                 boolean myanmarAsWords)
```
    Creates a new config. This object is lightweight, but the first time the class is referenced, breakiterators will be initialized.
    
    Parameters:
    
    cjkAsWords - true if cjk text should undergo dictionary-based segmentation, otherwise text will be segmented according to UAX#29 defaults. If this is true, all Han+Hiragana+Katakana words will be tagged as IDEOGRAPHIC.
    
    myanmarAsWords - true if Myanmar text should undergo dictionary-based segmentation, otherwise it will be tokenized as syllables.
- Method Detail
  - combineCJ
```
public boolean combineCJ()
```
    Description copied from class: ICUTokenizerConfig
    
    true if Han, Hiragana, and Katakana scripts should all be returned as Japanese
    
    Specified by:
    
    combineCJ in class ICUTokenizerConfig
  - getBreakIterator
```
public com.ibm.icu.text.RuleBasedBreakIterator getBreakIterator(int script)
```
    Description copied from class: ICUTokenizerConfig
    
    Return a breakiterator capable of processing a given script.
    
    Specified by:
    
    getBreakIterator in class ICUTokenizerConfig
  - getType
```
public String getType(int script,
                      int ruleStatus)
```
    Description copied from class: ICUTokenizerConfig
    
    Return a token type value for a given script and BreakIterator rule status.
    
    Specified by:
    
    getType in class ICUTokenizerConfig

Class DefaultICUTokenizerConfig

Field Summary

Fields inherited from class org.apache.lucene.analysis.icu.segmentation.ICUTokenizerConfig

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

WORD_IDEO

WORD_HIRAGANA

WORD_KATAKANA

WORD_HANGUL

WORD_LETTER

WORD_NUMBER

WORD_EMOJI

Constructor Detail

DefaultICUTokenizerConfig

Method Detail

combineCJ

getBreakIterator

getType