public class HyphenationCompoundWordTokenFilter extends CompoundWordTokenFilterBase
TokenFilter
that decomposes compound words found in many Germanic languages.
"Donaudampfschiff" becomes Donau, dampf, schiff so that you can find
"Donaudampfschiff" even when you only enter "schiff". It uses a hyphenation
grammar and a word dictionary to achieve this.CompoundWordTokenFilterBase.CompoundToken
AttributeSource.State
DEFAULT_MAX_SUBWORD_SIZE, DEFAULT_MIN_SUBWORD_SIZE, DEFAULT_MIN_WORD_SIZE, dictionary, maxSubwordSize, minSubwordSize, minWordSize, offsetAtt, onlyLongestMatch, termAtt, tokens
input
DEFAULT_TOKEN_ATTRIBUTE_FACTORY
Constructor and Description |
---|
HyphenationCompoundWordTokenFilter(TokenStream input,
HyphenationTree hyphenator)
Create a HyphenationCompoundWordTokenFilter with no dictionary.
|
HyphenationCompoundWordTokenFilter(TokenStream input,
HyphenationTree hyphenator,
CharArraySet dictionary)
Creates a new
HyphenationCompoundWordTokenFilter instance. |
HyphenationCompoundWordTokenFilter(TokenStream input,
HyphenationTree hyphenator,
CharArraySet dictionary,
int minWordSize,
int minSubwordSize,
int maxSubwordSize,
boolean onlyLongestMatch)
Creates a new
HyphenationCompoundWordTokenFilter instance. |
HyphenationCompoundWordTokenFilter(TokenStream input,
HyphenationTree hyphenator,
int minWordSize,
int minSubwordSize,
int maxSubwordSize)
Create a HyphenationCompoundWordTokenFilter with no dictionary.
|
Modifier and Type | Method and Description |
---|---|
protected void |
decompose()
Decomposes the current
CompoundWordTokenFilterBase.termAtt and places CompoundWordTokenFilterBase.CompoundToken instances in the CompoundWordTokenFilterBase.tokens list. |
static HyphenationTree |
getHyphenationTree(InputSource hyphenationSource)
Create a hyphenator tree
|
static HyphenationTree |
getHyphenationTree(String hyphenationFilename)
Create a hyphenator tree
|
incrementToken, reset
close, end
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
public HyphenationCompoundWordTokenFilter(TokenStream input, HyphenationTree hyphenator, CharArraySet dictionary)
HyphenationCompoundWordTokenFilter
instance.input
- the TokenStream
to processhyphenator
- the hyphenation pattern tree to use for hyphenationdictionary
- the word dictionary to match against.public HyphenationCompoundWordTokenFilter(TokenStream input, HyphenationTree hyphenator, CharArraySet dictionary, int minWordSize, int minSubwordSize, int maxSubwordSize, boolean onlyLongestMatch)
HyphenationCompoundWordTokenFilter
instance.input
- the TokenStream
to processhyphenator
- the hyphenation pattern tree to use for hyphenationdictionary
- the word dictionary to match against.minWordSize
- only words longer than this get processedminSubwordSize
- only subwords longer than this get to the output streammaxSubwordSize
- only subwords shorter than this get to the output streamonlyLongestMatch
- Add only the longest matching subword to the streampublic HyphenationCompoundWordTokenFilter(TokenStream input, HyphenationTree hyphenator, int minWordSize, int minSubwordSize, int maxSubwordSize)
public HyphenationCompoundWordTokenFilter(TokenStream input, HyphenationTree hyphenator)
public static HyphenationTree getHyphenationTree(String hyphenationFilename) throws IOException
hyphenationFilename
- the filename of the XML grammar to loadIOException
- If there is a low-level I/O error.public static HyphenationTree getHyphenationTree(InputSource hyphenationSource) throws IOException
hyphenationSource
- the InputSource pointing to the XML grammarIOException
- If there is a low-level I/O error.protected void decompose()
CompoundWordTokenFilterBase
CompoundWordTokenFilterBase.termAtt
and places CompoundWordTokenFilterBase.CompoundToken
instances in the CompoundWordTokenFilterBase.tokens
list.
The original token may not be placed in the list, as it is automatically passed through this filter.decompose
in class CompoundWordTokenFilterBase
Copyright © 2000-2021 Apache Software Foundation. All Rights Reserved.