public class DFISimilarity extends SimilarityBase
DFI is both parameter-free and non-parametric:
It is highly recommended not to remove stopwords (very common terms: the, of, and, to, a, in, for, is, on, that, etc) with this similarity.
For more information see: A nonparametric term weighting method for information retrieval based on measuring the divergence from independence
IndependenceStandardized
,
IndependenceSaturated
,
IndependenceChiSquared
Similarity.SimScorer
discountOverlaps
Constructor and Description |
---|
DFISimilarity(Independence independenceMeasure)
Create DFI with the specified divergence from independence measure
|
Modifier and Type | Method and Description |
---|---|
protected Explanation |
explain(BasicStats stats,
Explanation freq,
double docLen)
Explains the score.
|
Independence |
getIndependence()
Returns the measure of independence
|
protected double |
score(BasicStats stats,
double freq,
double docLen)
Scores the document
doc . |
String |
toString()
Subclasses must override this method to return the name of the Similarity
and preferably the values of parameters (if any) as well.
|
computeNorm, explain, fillBasicStats, getDiscountOverlaps, log2, newStats, scorer, setDiscountOverlaps
public DFISimilarity(Independence independenceMeasure)
independenceMeasure
- measure of divergence from independenceprotected double score(BasicStats stats, double freq, double docLen)
SimilarityBase
doc
.
Subclasses must apply their scoring formula in this class.
score
in class SimilarityBase
stats
- the corpus level statistics.freq
- the term frequency.docLen
- the document length.public Independence getIndependence()
protected Explanation explain(BasicStats stats, Explanation freq, double docLen)
SimilarityBase
SimilarityBase.score(BasicStats, double, double)
method) and the explanation for the term frequency. Subclasses content with
this format may add additional details in
SimilarityBase.explain(List, BasicStats, double, double)
.explain
in class SimilarityBase
stats
- the corpus level statistics.freq
- the term frequency and its explanation.docLen
- the document length.public String toString()
SimilarityBase
toString
in class SimilarityBase
Copyright © 2000-2021 Apache Software Foundation. All Rights Reserved.