Probabilistic
information retrieval
estimate
the probability of a term t appearing in a relevant document P(t|R = 1), and
that this could be the basis of a classifier that decides whether documents are
relevant or not.
1.
basic
probability theory: prior probability, posterior probability, odds
2.
The
Probability Ranking Principle: The 1/0 loss case
3.
Binary
dependence model
(1) pt = P(xt = 1|R =
1,~q): probability of a term appearing in a document relevant to the query; ut =
P(xt = 1|R = 0,~q): be the probability of a term appearing in a nonrelevant
document
(2) Determine a
guess for the size of the relevant document set.
(3) Improve our
guesses for pt and ut.
(4) Go to step 2
until the ranking of the returned results converges.
Language models
A
document is a good match to a query if the document model is likely to generate
the query, which will in turn happen if the document contains the query words
often.
The
basic language modeling approach builds a probabilistic language model Md from
each document d, and ranks documents based on the probability of the model
generating the query: P(q|Md).
1.
Language
models:
Generative
model
Language
of automation: the full set of strings that can be generated from formal
language theory
Language
model: a function that puts a probability measure over strings drawn from some
vocabulary
unigram
language model: Puni(t1t2t3t4) =
P(t1)P(t2)P(t3)P(t4)
2.
The
query likelihood model
(1) Goal: rank documents by P(d|q)
(2) Method: using Bayes Rule P(d|q) =
P(q|d)P(d)/P(q)
à
P(q|d), the probability of the query q under the language model derived from d;
a
query would be observed as a random sample from the respective document model
(P(d)
and P(q) is uniform and therefore usually ignored)
à multinomial
Naive Bayes model(page 263)
(3)
estimate P(q|Md): count up how often each word occurred, and divide through by
the total number of words in the document d.
(4)
Optimization: smooth probabilities in our document language models by to
discount non-zero probabilities and to give some probability mass to unseen
words.