Traditional adaptive ltering systems learn the user's interests in a rather simple way { words from relevant
documents are favored in the query model, while words from irrelevant documents are down-weighted. This biases the query model towards speci c words seen in the past, causing the system to favor documents containing relevant but redundant information over documents that use previously unseen words to denote new facts about the same news event. This paper proposes news ways of generalizing from relevance feedback by augmenting the traditional bag-of-words query model with named entity wildcards that
are anchored in context. The use of wildcards allows generalization beyond speci c words, while contextual
restrictions limit the wildcard-matching to entities related to the user's query. We test our new approach in a nugget-level adaptive ltering system and evaluate it in terms of both relevance and novelty of the presented information. Our results indicate that higher recall is obtained when lexical terms are generalized using wildcards. However, such wildcards must be anchored to their context to maintain good precision. How the context of a wildcard is represented and matched against a given document also plays a crucial role in the performance of the retrieval system.
Blog for Class IS2140
Saturday, April 19, 2014
Friday, April 11, 2014
Week 13 Reading Notes: Text classification and clustering
1.
Text classification and Naïve Bayes:
Many users have ongoing information
needs, One way of doing this is to issue the query multicore against an index
of recent newswire articles each morning. A standing query is like any other query
except that it is periodically executed on a collection to which new documents
are incrementally added over time.
2.
Vector space classification
Adopt a different representation for
text classification, the vector space model. It represents each document as a
vector with one real-valued component, usually a tf-idf weight, for each term.
3.
Flat clustering
(1)
Clustering algorithms group a
set of documents into subsets CLUSTER or clusters. The algorithms’ goal is to
create clusters that are coherent internally, but clearly different fromeach
other. In other words, documents within a cluster should be as similar as
possible; and documents in one cluster should be as dissimilar as possible from
documents in other clusters.
(2)
Cluster hypothesis. Documents
in the same cluster behave similarly with respect to relevance to information
needs.
4.
Hierarchical clustering: single-link,
completelink, group-average, and centroid similarity
Friday, April 4, 2014
Week 12 Reading Notes: Intelligent information retrieval
1. User Profiles for
Personalized Information Access
(1) User Profiling: gather,
and exploit, some information about individuals in order to be effective.
(2) Explicit User
Information Collection: user feedback, customization, navigation
(3) Implicit User
Information Collection: browser caches, proxy servers, browser agents, desktop
agents, and search logs.
2. Content-Based
Recommendation Systems
describing the items that
may be recommended, a means for creating a profile of the user that describes
the types of items the user likes, and a means of comparing items to the user
profile to determine what to recommend
(1)
Item
Representation: Items that can be recommended to the user are often stored in a
database table.
(2)
User Profiles:
A profile of the user’s interests
(3)
Learning a
User Model: Creating a model of the user’s preference from the user history
(4)
Decision
Trees and Rule Induction
(5)
Nearest
Neighbor Methods
(6)
Relevance
Feedback and Rocchio’s Algorithm
(7)
Linear
Classifiers
(8)
Probabilistic
Methods and Naïve Bayes
Friday, March 28, 2014
Week 11 Reading Notes: Multilingual and parallel retrieval
Parallel Information
Retrieval
Topic: Ways of making
information retrieval systems scale to very large text collections, to resolve
limitations of computation power, storage capabilities.
1.
Parallel
query processing
a)
search
engine’s service rate is increased by having multiple index servers process
incoming queries in parallel(Document Partitioning, Term Partitioning, Hybrid
Schemes)
b)
redundancy
and fault tolerance issues in distributed search engines
2.
MapReduce:
parallel execution of off-line tasks
Subscribe to:
Comments (Atom)