Blog for Class IS2140

Saturday, April 19, 2014

Week 14 Reading Notes: New fronts in information retrieval

Traditional adaptive ltering systems learn the user's interests in a rather simple way { words from relevant
documents are favored in the query model, while words from irrelevant documents are down-weighted. This biases the query model towards speci c words seen in the past, causing the system to favor documents containing relevant but redundant information over documents that use previously unseen words to denote new facts about the same news event. This paper proposes news ways of generalizing from relevance feedback by augmenting the traditional bag-of-words query model with named entity wildcards that
are anchored in context. The use of wildcards allows generalization beyond speci c words, while contextual
restrictions limit the wildcard-matching to entities related to the user's query. We test our new approach in a nugget-level adaptive ltering system and evaluate it in terms of both relevance and novelty of the presented information. Our results indicate that higher recall is obtained when lexical terms are generalized using wildcards. However, such wildcards must be anchored to their context to maintain good precision. How the context of a wildcard is represented and matched against a given document also plays a crucial role in the performance of the retrieval system.

Week 13 Muddiest Points

No Muddiest Point for this week

Friday, April 11, 2014

Week 13 Reading Notes: Text classification and clustering

1. Text classification and Naïve Bayes:

Many users have ongoing information needs, One way of doing this is to issue the query multicore against an index of recent newswire articles each morning. A standing query is like any other query except that it is periodically executed on a collection to which new documents are incrementally added over time.

2. Vector space classification

Adopt a different representation for text classification, the vector space model. It represents each document as a vector with one real-valued component, usually a tf-idf weight, for each term.

3. Flat clustering

(1) Clustering algorithms group a set of documents into subsets CLUSTER or clusters. The algorithms’ goal is to create clusters that are coherent internally, but clearly different fromeach other. In other words, documents within a cluster should be as similar as possible; and documents in one cluster should be as dissimilar as possible from documents in other clusters.

(2) Cluster hypothesis. Documents in the same cluster behave similarly with respect to relevance to information needs.

4. Hierarchical clustering: single-link, completelink, group-average, and centroid similarity

Week 12 Muddiest Points

No Muddiest Point for this week

Friday, April 4, 2014

Week 12 Reading Notes: Intelligent information retrieval

1. User Profiles for Personalized Information Access

(1) User Profiling: gather, and exploit, some information about individuals in order to be effective.

(2) Explicit User Information Collection: user feedback, customization, navigation

(3) Implicit User Information Collection: browser caches, proxy servers, browser agents, desktop agents, and search logs.

2. Content-Based Recommendation Systems

describing the items that may be recommended, a means for creating a profile of the user that describes the types of items the user likes, and a means of comparing items to the user profile to determine what to recommend

(1) Item Representation: Items that can be recommended to the user are often stored in a database table.

(2) User Profiles: A profile of the user’s interests

(3) Learning a User Model: Creating a model of the user’s preference from the user history

(4) Decision Trees and Rule Induction

(5) Nearest Neighbor Methods

(6) Relevance Feedback and Rocchio’s Algorithm

(7) Linear Classifiers

(8) Probabilistic Methods and Naïve Bayes

Week 11 Muddist Points

No muddiest point for this week.

Friday, March 28, 2014

Week 11 Reading Notes: Multilingual and parallel retrieval

Parallel Information Retrieval

Topic: Ways of making information retrieval systems scale to very large text collections, to resolve limitations of computation power, storage capabilities.

1. Parallel query processing

a) search engine’s service rate is increased by having multiple index servers process incoming queries in parallel(Document Partitioning, Term Partitioning, Hybrid Schemes)

b) redundancy and fault tolerance issues in distributed search engines

2. MapReduce: parallel execution of off-line tasks