public class JapaneseNumberFilter extends TokenFilter
TokenFilter
that normalizes Japanese numbers (kansūji) to regular Arabic
decimal numbers in half-width characters.
Japanese numbers are often written using a combination of kanji and Arabic numbers with various kinds punctuation. For example, 3.2千 means 3200. This filter does this kind of normalization and allows a search for 3200 to match 3.2千 in text, but can also be used to make range facets based on the normalized numbers and so on.
Notice that this analyzer uses a token composition scheme and relies on punctuation
tokens being found in the token stream. Please make sure your JapaneseTokenizer
has discardPunctuation
set to false. In case punctuation characters, such as .
(U+FF0E FULLWIDTH FULL STOP), is removed from the token stream, this filter would find
input tokens tokens 3 and 2千 and give outputs 3 and 2000 instead of 3200, which is
likely not the intended result. If you want to remove punctuation characters from your
index that are not part of normalized numbers, add a
StopFilter
with the punctuation you wish to
remove after JapaneseNumberFilter
in your analyzer chain.
Below are some examples of normalizations this filter supports. The input is untokenized text and the result is the single term attribute emitted for the input.
Tokens preceded by a token with PositionIncrementAttribute
of zero are left
left untouched and emitted as-is.
This filter does not use any part-of-speech information for its normalization and the motivation for this is to also support n-grammed token streams in the future.
This filter may in some cases normalize tokens that are not numbers in their context.
For example, is 田中京一 is a name and means Tanaka Kyōichi, but 京一 (Kyōichi) out of
context can strictly speaking also represent the number 10000000000000001. This filter
respects the KeywordAttribute
, which can be used to prevent specific
normalizations from happening.
Also notice that token attributes such as
PartOfSpeechAttribute
,
ReadingAttribute
,
InflectionAttribute
and
BaseFormAttribute
are left
unchanged and will inherit the values of the last token used to compose the normalized
number and can be wrong. Hence, for 10万 (10000), we will have
ReadingAttribute
set to マン. This is a known issue and is subject to a future improvement.
Japanese formal numbers (daiji), accounting numbers and decimal fractions are currently not supported.
Modifier and Type | Class and Description |
---|---|
static class |
JapaneseNumberFilter.NumberBuffer
Buffer that holds a Japanese number string and a position index used as a parsed-to marker
|
AttributeSource.State
input
DEFAULT_TOKEN_ATTRIBUTE_FACTORY
Constructor and Description |
---|
JapaneseNumberFilter(TokenStream input) |
Modifier and Type | Method and Description |
---|---|
boolean |
incrementToken() |
boolean |
isArabicNumeral(char c)
Arabic numeral predicate.
|
boolean |
isNumeral(char c)
Numeral predicate
|
boolean |
isNumeral(String input)
Numeral predicate
|
boolean |
isNumeralPunctuation(char c)
Numeral punctuation predicate
|
boolean |
isNumeralPunctuation(String input)
Numeral punctuation predicate
|
String |
normalizeNumber(String number)
Normalizes a Japanese number
|
BigDecimal |
parseLargeKanjiNumeral(JapaneseNumberFilter.NumberBuffer buffer)
Parse large kanji numerals (ten thousands or larger)
|
BigDecimal |
parseMediumKanjiNumeral(JapaneseNumberFilter.NumberBuffer buffer)
Parse medium kanji numerals (tens, hundreds or thousands)
|
void |
reset() |
close, end
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
public JapaneseNumberFilter(TokenStream input)
public final boolean incrementToken() throws IOException
incrementToken
in class TokenStream
IOException
public void reset() throws IOException
reset
in class TokenFilter
IOException
public String normalizeNumber(String number)
number
- number or normalizepublic BigDecimal parseLargeKanjiNumeral(JapaneseNumberFilter.NumberBuffer buffer)
buffer
- buffer to parsepublic BigDecimal parseMediumKanjiNumeral(JapaneseNumberFilter.NumberBuffer buffer)
buffer
- buffer to parsepublic boolean isNumeral(String input)
input
- string to testpublic boolean isNumeral(char c)
c
- character to testpublic boolean isNumeralPunctuation(String input)
input
- string to testpublic boolean isNumeralPunctuation(char c)
c
- character to testpublic boolean isArabicNumeral(char c)
c
- character to testCopyright © 2000-2021 Apache Software Foundation. All Rights Reserved.