Information retrieval has developed as a
highly empirical discipline, requiring careful and thorough evaluation to
demonstrate the superior performance of novel techniques on representative document
collections.
1.
Test collection:
A document collection;
A test suite of
information needs, expressible as queries;
A set of relevance judgments, standardly a binary
assessment of either relevant or nonrelevant for each query-document pair.
Standard test collections: Cranfield, Text Retrieval
Conference (TREC)
2.
Evaluation of unranked
retrieval sets
(1)
Precision (P) is the fraction
of retrieved documents that are relevant
(2)
Recall (R) is the fraction of
relevant documents that are retrieved
(3)
accuracy is the fraction of its
classifications that are correct
(4)
F measure, which is the
weighted harmonic mean of precision and recall
3.
Evaluation of ranked retrieval
sets
precision-recall curve, interpolated precision
4.
Assessing relevance
Pooling: where relevance is assessed over a subset of the
collection that is formed from the top k documents returned by a number of
different IR systems
marginal
relevance: whether a document still has distinctive usefulness after the user
has looked at certain other documents
5.
System quality and user utility
(1)
User utility: a way of
quantifying aggregate user happiness, based on the relevance, speed, and user
interface of a system
(2)
Refining a deployed system: A/B
TEST
No comments:
Post a Comment