Blog for Class IS2140: Week 5 Reading Notes: Matching models- probabilistic and language model

Probabilistic information retrieval

estimate the probability of a term t appearing in a relevant document P(t|R = 1), and that this could be the basis of a classifier that decides whether documents are relevant or not.

1. basic probability theory: prior probability, posterior probability, odds

2. The Probability Ranking Principle: The 1/0 loss case

3. Binary dependence model

(1) pt = P(xt = 1|R = 1,~q): probability of a term appearing in a document relevant to the query; ut = P(xt = 1|R = 0,~q): be the probability of a term appearing in a nonrelevant document

(2) Determine a guess for the size of the relevant document set.

(3) Improve our guesses for pt and ut.

(4) Go to step 2 until the ranking of the returned results converges.

Language models

A document is a good match to a query if the document model is likely to generate the query, which will in turn happen if the document contains the query words often.

The basic language modeling approach builds a probabilistic language model Md from each document d, and ranks documents based on the probability of the model generating the query: P(q|Md).

1. Language models:

Generative model

Language of automation: the full set of strings that can be generated from formal language theory

Language model: a function that puts a probability measure over strings drawn from some vocabulary

unigram language model: Puni(t1t2t3t4) = P(t1)P(t2)P(t3)P(t4)

2. The query likelihood model

(1) Goal: rank documents by P(d|q)

(2) Method: using Bayes Rule P(d|q) = P(q|d)P(d)/P(q)

à P(q|d), the probability of the query q under the language model derived from d;

a query would be observed as a random sample from the respective document model

(P(d) and P(q) is uniform and therefore usually ignored)

à multinomial Naive Bayes model(page 263)

(3) estimate P(q|Md): count up how often each word occurred, and divide through by the total number of words in the document d.

(4) Optimization: smooth probabilities in our document language models by to discount non-zero probabilities and to give some probability mass to unseen words.

Blog for Class IS2140

Monday, February 3, 2014

Week 5 Reading Notes: Matching models- probabilistic and language model

No comments:

Post a Comment