public final class QueryAutoStopWordAnalyzer extends AnalyzerWrapper
Analyzer
used primarily at query time to wrap another analyzer and provide a layer of protection
which prevents very common words from being passed into queries.
For very large indexes the cost of reading TermDocs for a very common word can be high. This analyzer was created after experience with a 38 million doc index which had a term in around 50% of docs and was causing TermQueries for this term to take 2 seconds.
Analyzer.ReuseStrategy, Analyzer.TokenStreamComponents
Modifier and Type | Field and Description |
---|---|
static float |
defaultMaxDocFreqPercent |
GLOBAL_REUSE_STRATEGY, PER_FIELD_REUSE_STRATEGY
Constructor and Description |
---|
QueryAutoStopWordAnalyzer(Analyzer delegate,
IndexReader indexReader)
Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for all
indexed fields from terms with a document frequency percentage greater than
defaultMaxDocFreqPercent |
QueryAutoStopWordAnalyzer(Analyzer delegate,
IndexReader indexReader,
Collection<String> fields,
float maxPercentDocs)
Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for the
given selection of fields from terms with a document frequency percentage
greater than the given maxPercentDocs
|
QueryAutoStopWordAnalyzer(Analyzer delegate,
IndexReader indexReader,
Collection<String> fields,
int maxDocFreq)
Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for the
given selection of fields from terms with a document frequency greater than
the given maxDocFreq
|
QueryAutoStopWordAnalyzer(Analyzer delegate,
IndexReader indexReader,
float maxPercentDocs)
Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for all
indexed fields from terms with a document frequency percentage greater than
the given maxPercentDocs
|
QueryAutoStopWordAnalyzer(Analyzer delegate,
IndexReader indexReader,
int maxDocFreq)
Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for all
indexed fields from terms with a document frequency greater than the given
maxDocFreq
|
Modifier and Type | Method and Description |
---|---|
Term[] |
getStopWords()
Provides information on which stop words have been identified for all fields
|
String[] |
getStopWords(String fieldName)
Provides information on which stop words have been identified for a field
|
protected Analyzer |
getWrappedAnalyzer(String fieldName) |
protected Analyzer.TokenStreamComponents |
wrapComponents(String fieldName,
Analyzer.TokenStreamComponents components) |
attributeFactory, createComponents, getOffsetGap, getPositionIncrementGap, initReader, initReaderForNormalization, normalize, wrapReader, wrapReaderForNormalization, wrapTokenStreamForNormalization
close, getReuseStrategy, getVersion, normalize, setVersion, tokenStream, tokenStream
public static final float defaultMaxDocFreqPercent
public QueryAutoStopWordAnalyzer(Analyzer delegate, IndexReader indexReader) throws IOException
defaultMaxDocFreqPercent
delegate
- Analyzer whose TokenStream will be filteredindexReader
- IndexReader to identify the stopwords fromIOException
- Can be thrown while reading from the IndexReaderpublic QueryAutoStopWordAnalyzer(Analyzer delegate, IndexReader indexReader, int maxDocFreq) throws IOException
delegate
- Analyzer whose TokenStream will be filteredindexReader
- IndexReader to identify the stopwords frommaxDocFreq
- Document frequency terms should be above in order to be stopwordsIOException
- Can be thrown while reading from the IndexReaderpublic QueryAutoStopWordAnalyzer(Analyzer delegate, IndexReader indexReader, float maxPercentDocs) throws IOException
delegate
- Analyzer whose TokenStream will be filteredindexReader
- IndexReader to identify the stopwords frommaxPercentDocs
- The maximum percentage (between 0.0 and 1.0) of index documents which
contain a term, after which the word is considered to be a stop wordIOException
- Can be thrown while reading from the IndexReaderpublic QueryAutoStopWordAnalyzer(Analyzer delegate, IndexReader indexReader, Collection<String> fields, float maxPercentDocs) throws IOException
delegate
- Analyzer whose TokenStream will be filteredindexReader
- IndexReader to identify the stopwords fromfields
- Selection of fields to calculate stopwords formaxPercentDocs
- The maximum percentage (between 0.0 and 1.0) of index documents which
contain a term, after which the word is considered to be a stop wordIOException
- Can be thrown while reading from the IndexReaderpublic QueryAutoStopWordAnalyzer(Analyzer delegate, IndexReader indexReader, Collection<String> fields, int maxDocFreq) throws IOException
delegate
- Analyzer whose TokenStream will be filteredindexReader
- IndexReader to identify the stopwords fromfields
- Selection of fields to calculate stopwords formaxDocFreq
- Document frequency terms should be above in order to be stopwordsIOException
- Can be thrown while reading from the IndexReaderprotected Analyzer getWrappedAnalyzer(String fieldName)
getWrappedAnalyzer
in class AnalyzerWrapper
protected Analyzer.TokenStreamComponents wrapComponents(String fieldName, Analyzer.TokenStreamComponents components)
wrapComponents
in class AnalyzerWrapper
public String[] getStopWords(String fieldName)
fieldName
- The field for which stop words identified in "addStopWords"
method calls will be returnedpublic Term[] getStopWords()
Copyright © 2000-2021 Apache Software Foundation. All Rights Reserved.