public class IndriDirichletSimilarity extends LMSimilarity
tf_E + mu*P(t|D) P(t|E)= documentLength + documentMu mu*P(t|C) + tf_D where P(t|D)= doclen + mu
A larger value for mu, produces more smoothing. Smoothing is most important for short documents where the probabilities are more granular.
Modifier and Type | Class and Description |
---|---|
static class |
IndriDirichletSimilarity.IndriCollectionModel
Models
p(w|C) as the number of occurrences of the term in the collection, divided by
the total number of tokens + 1 . |
LMSimilarity.CollectionModel, LMSimilarity.DefaultCollectionModel, LMSimilarity.LMStats
Similarity.SimScorer
collectionModel
discountOverlaps
Constructor and Description |
---|
IndriDirichletSimilarity()
Instantiates the similarity with the default μ value of 2000.
|
IndriDirichletSimilarity(float mu)
Instantiates the similarity with the provided μ parameter.
|
IndriDirichletSimilarity(LMSimilarity.CollectionModel collectionModel)
Instantiates the similarity with the default μ value of 2000.
|
IndriDirichletSimilarity(LMSimilarity.CollectionModel collectionModel,
float mu)
Instantiates the similarity with the provided μ parameter.
|
Modifier and Type | Method and Description |
---|---|
protected void |
explain(List<Explanation> subs,
BasicStats stats,
double freq,
double docLen)
Subclasses should implement this method to explain the score.
|
float |
getMu()
Returns the μ parameter.
|
String |
getName()
Returns the name of the LM method.
|
protected double |
score(BasicStats stats,
double freq,
double docLen)
Scores the document
doc . |
fillBasicStats, newStats, toString
computeNorm, explain, getDiscountOverlaps, log2, scorer, setDiscountOverlaps
public IndriDirichletSimilarity(LMSimilarity.CollectionModel collectionModel, float mu)
public IndriDirichletSimilarity(float mu)
public IndriDirichletSimilarity(LMSimilarity.CollectionModel collectionModel)
public IndriDirichletSimilarity()
protected double score(BasicStats stats, double freq, double docLen)
SimilarityBase
doc
.
Subclasses must apply their scoring formula in this class.
score
in class SimilarityBase
stats
- the corpus level statistics.freq
- the term frequency.docLen
- the document length.protected void explain(List<Explanation> subs, BasicStats stats, double freq, double docLen)
SimilarityBase
expl
already contains the score, the name of the class and the doc id, as well
as the term frequency and its explanation; subclasses can add additional
clauses to explain details of their scoring formulae.
The default implementation does nothing.
explain
in class LMSimilarity
subs
- the list of details of the explanation to extendstats
- the corpus level statistics.freq
- the term frequency.docLen
- the document length.public float getMu()
public String getName()
LMSimilarity
Used in LMSimilarity.toString()
getName
in class LMSimilarity
Copyright © 2000-2021 Apache Software Foundation. All Rights Reserved.