public class JapaneseTokenizerFactory extends TokenizerFactory implements ResourceLoaderAware
JapaneseTokenizer
.
<fieldType name="text_ja" class="solr.TextField"> <analyzer> <tokenizer class="solr.JapaneseTokenizerFactory" mode="NORMAL" userDictionary="user.txt" userDictionaryEncoding="UTF-8" discardPunctuation="true" discardCompoundToken="false" /> <filter class="solr.JapaneseBaseFormFilterFactory"/> </analyzer> </fieldType>
Additional expert user parameters nBestCost and nBestExamples can be used to include additional searchable tokens that those most likely according to the statistical model. A typical use-case for this is to improve recall and make segmentation more resilient to mistakes. The feature can also be used to get a decompounding effect.
The nBestCost parameter specifies an additional Viterbi cost, and when used, JapaneseTokenizer will include all tokens in Viterbi paths that are within the nBestCost value of the best path.
Finding a good value for nBestCost can be difficult to do by hand. The nBestExamples parameter can be used to find an nBestCost value based on examples with desired segmentation outcomes.
For example, a value of /箱根山-箱根/成田空港-成田/ indicates that in the texts, 箱根山 (Mt. Hakone) and 成田空港 (Narita Airport) we'd like a cost that gives is us 箱根 (Hakone) and 成田 (Narita). Notice that costs are estimated for each example individually, and the maximum nBestCost found across all examples is used.
If both nBestCost and nBestExamples is used in a configuration, the largest value of the two is used.
Parameters nBestCost and nBestExamples work with all tokenizer modes, but it makes the most sense to use them with NORMAL mode.
Modifier and Type | Field and Description |
---|---|
static String |
NAME
SPI name
|
LUCENE_MATCH_VERSION_PARAM, luceneMatchVersion
Constructor and Description |
---|
JapaneseTokenizerFactory(Map<String,String> args)
Creates a new JapaneseTokenizerFactory
|
Modifier and Type | Method and Description |
---|---|
JapaneseTokenizer |
create(AttributeFactory factory) |
void |
inform(ResourceLoader loader) |
availableTokenizers, create, findSPIName, forName, lookupClass, reloadTokenizers
get, get, get, get, get, getBoolean, getChar, getClassArg, getFloat, getInt, getLines, getLuceneMatchVersion, getOriginalArgs, getPattern, getSet, getSnowballWordSet, getWordSet, isExplicitLuceneMatchVersion, require, require, require, requireBoolean, requireChar, requireFloat, requireInt, setExplicitLuceneMatchVersion, splitAt, splitFileNames
public static final String NAME
public void inform(ResourceLoader loader) throws IOException
inform
in interface ResourceLoaderAware
IOException
public JapaneseTokenizer create(AttributeFactory factory)
create
in class TokenizerFactory
Copyright © 2000-2021 Apache Software Foundation. All Rights Reserved.