Monday, February 3, 2014

Week 5 Reading Notes: Matching models- probabilistic and language model

Probabilistic information retrieval
estimate the probability of a term t appearing in a relevant document P(t|R = 1), and that this could be the basis of a classifier that decides whether documents are relevant or not.
1.    basic probability theory: prior probability, posterior probability, odds
2.    The Probability Ranking Principle: The 1/0 loss case
3.    Binary dependence model
(1)  pt = P(xt = 1|R = 1,~q): probability of a term appearing in a document relevant to the query; ut = P(xt = 1|R = 0,~q): be the probability of a term appearing in a nonrelevant document
(2)  Determine a guess for the size of the relevant document set.
(3)  Improve our guesses for pt and ut.
(4)  Go to step 2 until the ranking of the returned results converges.

Language models
A document is a good match to a query if the document model is likely to generate the query, which will in turn happen if the document contains the query words often.

The basic language modeling approach builds a probabilistic language model Md from each document d, and ranks documents based on the probability of the model generating the query: P(q|Md).

1.    Language models:
Generative model
Language of automation: the full set of strings that can be generated from formal language theory
Language model: a function that puts a probability measure over strings drawn from some vocabulary
unigram language model:  Puni(t1t2t3t4) = P(t1)P(t2)P(t3)P(t4)

2.    The query likelihood model
(1)   Goal: rank documents by P(d|q)
(2)   Method: using Bayes Rule P(d|q) = P(q|d)P(d)/P(q)
à P(q|d), the probability of the query q under the language model derived from d;
a query would be observed as a random sample from the respective document model
(P(d) and P(q) is uniform and therefore usually ignored)
à multinomial Naive Bayes model(page 263) 
(3) estimate P(q|Md): count up how often each word occurred, and divide through by the total number of words in the document d.
(4) Optimization: smooth probabilities in our document language models by to discount non-zero probabilities and to give some probability mass to unseen words.



No comments:

Post a Comment