Blog for Class IS2140: 2014

Saturday, April 19, 2014

Week 14 Reading Notes: New fronts in information retrieval

Traditional adaptive ltering systems learn the user's interests in a rather simple way { words from relevant
documents are favored in the query model, while words from irrelevant documents are down-weighted. This biases the query model towards speci c words seen in the past, causing the system to favor documents containing relevant but redundant information over documents that use previously unseen words to denote new facts about the same news event. This paper proposes news ways of generalizing from relevance feedback by augmenting the traditional bag-of-words query model with named entity wildcards that
are anchored in context. The use of wildcards allows generalization beyond speci c words, while contextual
restrictions limit the wildcard-matching to entities related to the user's query. We test our new approach in a nugget-level adaptive ltering system and evaluate it in terms of both relevance and novelty of the presented information. Our results indicate that higher recall is obtained when lexical terms are generalized using wildcards. However, such wildcards must be anchored to their context to maintain good precision. How the context of a wildcard is represented and matched against a given document also plays a crucial role in the performance of the retrieval system.

Week 13 Muddiest Points

No Muddiest Point for this week

Friday, April 11, 2014

Week 13 Reading Notes: Text classification and clustering

1. Text classification and Naïve Bayes:

Many users have ongoing information needs, One way of doing this is to issue the query multicore against an index of recent newswire articles each morning. A standing query is like any other query except that it is periodically executed on a collection to which new documents are incrementally added over time.

2. Vector space classification

Adopt a different representation for text classification, the vector space model. It represents each document as a vector with one real-valued component, usually a tf-idf weight, for each term.

3. Flat clustering

(1) Clustering algorithms group a set of documents into subsets CLUSTER or clusters. The algorithms’ goal is to create clusters that are coherent internally, but clearly different fromeach other. In other words, documents within a cluster should be as similar as possible; and documents in one cluster should be as dissimilar as possible from documents in other clusters.

(2) Cluster hypothesis. Documents in the same cluster behave similarly with respect to relevance to information needs.

4. Hierarchical clustering: single-link, completelink, group-average, and centroid similarity

Week 12 Muddiest Points

No Muddiest Point for this week

Friday, April 4, 2014

Week 12 Reading Notes: Intelligent information retrieval

1. User Profiles for Personalized Information Access

(1) User Profiling: gather, and exploit, some information about individuals in order to be effective.

(2) Explicit User Information Collection: user feedback, customization, navigation

(3) Implicit User Information Collection: browser caches, proxy servers, browser agents, desktop agents, and search logs.

2. Content-Based Recommendation Systems

describing the items that may be recommended, a means for creating a profile of the user that describes the types of items the user likes, and a means of comparing items to the user profile to determine what to recommend

(1) Item Representation: Items that can be recommended to the user are often stored in a database table.

(2) User Profiles: A profile of the user’s interests

(3) Learning a User Model: Creating a model of the user’s preference from the user history

(4) Decision Trees and Rule Induction

(5) Nearest Neighbor Methods

(6) Relevance Feedback and Rocchio’s Algorithm

(7) Linear Classifiers

(8) Probabilistic Methods and Naïve Bayes

Week 11 Muddist Points

No muddiest point for this week.

Friday, March 28, 2014

Week 11 Reading Notes: Multilingual and parallel retrieval

Parallel Information Retrieval

Topic: Ways of making information retrieval systems scale to very large text collections, to resolve limitations of computation power, storage capabilities.

1. Parallel query processing

a) search engine’s service rate is increased by having multiple index servers process incoming queries in parallel(Document Partitioning, Term Partitioning, Hybrid Schemes)

b) redundancy and fault tolerance issues in distributed search engines

2. MapReduce: parallel execution of off-line tasks

Week 10 Muddist Points

As dynamic web programming is widely applied, more and more web pages cannot be parsed by traditional crawler. To retrieve information from those, what kind of technologies are used?

Friday, March 21, 2014

Week 10 Reading Notes: Web information retrieval

1. Web search basics

(1) Background and history about the forces that conspire to make the Web chaotic, fast-changing and very different from the “traditional” collections.

(2) Estimating the number of documents indexed by web search engines, and the elimination of duplicate documents in web indexes, respectively.

2. Link analysis

The use of hyperlinks for ranking web search results

(1) The use of web graph

(2) Page rank: the page rank of a node will depend on the link structure of the web graph. Given a query, a web search engine computes a composite score for each web page that combines hundreds of features such as cosine similarity and term proximity, together with the Page Rank score.

(3) Hyperlink-Induced Topic Search(HITS)

Sunday, March 9, 2014

Week 8 Muddist Points

No muddiest point for this week

Thursday, February 27, 2014

Week 8 Reading Notes: User interaction and visualization

1. Effective human-computer interface

(1) Design principles: offer informative feedback

(2) Reduce working memory load

(3) Provide alternative interfaces for novice and expert users.

Visualization, Evaluating Interactive Systems

2. Information Access Process

3. Starting Point

List of collections, overviews; examples, dialogs and wizards; automated source selection

4. Query specification

5. Context

Document surrogates, query term hits, superbook, categories, using hyperlinks to organize retrieval results, tables

6. Relevance judgements

Interfaces, studies of user interaction, group relevance judgments, pseudo-relevance judgment feedback

7. Interface support for search process

String matching, window management, example systems

Week 7 Muddist Points

No muddiest point for this week

Monday, February 17, 2014

Week 7 Reading Notes: Relevance feedback and query expansion

1. The issue of synonymy: impact on the recall of most information retrieval systems.

Solution: Query refinement,automatically or user in the loop

(1) Global methods include:

• Query expansion/reformulation with a thesaurus or WordNet

• Query expansion via automatic thesaurus generation

• Techniques like spelling correction

(2) Local methods adjust a query relative to the documents that initially appear to match the query. The basic methods here are:

• Relevance feedback

• Pseudo relevance feedback, also known as Blind relevance feedback

2. Relevance feedback (RF): involve the user in the retrieval process so as to improve the final result set.

(1) Basic procedure:

• The user issues a (short, simple) query.

• The system returns an initial set of retrieval results.

• The user marks some returned documents as relevant or nonrelevant.

• The system computes a better representation of the information need based on the user feedback.

• The system displays a revised set of retrieval results.

(2) The Rocchio algorithm

(3) very effective at improving relevance of results

Week 6 Muddist Points

When measuring ranked retrieval, by computing the precision at fixed recall points, it's possible to compare the performance of different IR systems. But since both precision and recall are very important evaluation measurements, how to combine these two together to make a better judgement? Are there methods act the same as the harmonic mean?

Friday, February 14, 2014

Week 6 Reading Notes: Evaluation

Information retrieval has developed as a highly empirical discipline, requiring careful and thorough evaluation to demonstrate the superior performance of novel techniques on representative document collections.

1. Test collection:

A document collection;

A test suite of information needs, expressible as queries;

A set of relevance judgments, standardly a binary assessment of either relevant or nonrelevant for each query-document pair.

Standard test collections: Cranfield, Text Retrieval Conference (TREC)

2. Evaluation of unranked retrieval sets

(1) Precision (P) is the fraction of retrieved documents that are relevant

(2) Recall (R) is the fraction of relevant documents that are retrieved

(3) accuracy is the fraction of its classifications that are correct

(4) F measure, which is the weighted harmonic mean of precision and recall

3. Evaluation of ranked retrieval sets

precision-recall curve, interpolated precision

4. Assessing relevance

Pooling: where relevance is assessed over a subset of the collection that is formed from the top k documents returned by a number of different IR systems

marginal relevance: whether a document still has distinctive usefulness after the user has looked at certain other documents

5. System quality and user utility

(1) User utility: a way of quantifying aggregate user happiness, based on the relevance, speed, and user interface of a system

(2) Refining a deployed system: A/B TEST

Week 5 Muddist Points

What is the biggest difference between Language Model and Classic Probabilistic Model, since both of them heavily use probability?

Monday, February 3, 2014

Week 4 Muddiest Points

In vector space model, should log-frequency weighting be calculated before or after stop words removal? how about normalization?

Week 5 Reading Notes: Matching models- probabilistic and language model

Probabilistic information retrieval

estimate the probability of a term t appearing in a relevant document P(t|R = 1), and that this could be the basis of a classifier that decides whether documents are relevant or not.

1. basic probability theory: prior probability, posterior probability, odds

2. The Probability Ranking Principle: The 1/0 loss case

3. Binary dependence model

(1) pt = P(xt = 1|R = 1,~q): probability of a term appearing in a document relevant to the query; ut = P(xt = 1|R = 0,~q): be the probability of a term appearing in a nonrelevant document

(2) Determine a guess for the size of the relevant document set.

(3) Improve our guesses for pt and ut.

(4) Go to step 2 until the ranking of the returned results converges.

Language models

A document is a good match to a query if the document model is likely to generate the query, which will in turn happen if the document contains the query words often.

The basic language modeling approach builds a probabilistic language model Md from each document d, and ranks documents based on the probability of the model generating the query: P(q|Md).

1. Language models:

Generative model

Language of automation: the full set of strings that can be generated from formal language theory

Language model: a function that puts a probability measure over strings drawn from some vocabulary

unigram language model: Puni(t1t2t3t4) = P(t1)P(t2)P(t3)P(t4)

2. The query likelihood model

(1) Goal: rank documents by P(d|q)

(2) Method: using Bayes Rule P(d|q) = P(q|d)P(d)/P(q)

à P(q|d), the probability of the query q under the language model derived from d;

a query would be observed as a random sample from the respective document model

(P(d) and P(q) is uniform and therefore usually ignored)

à multinomial Naive Bayes model(page 263)

(3) estimate P(q|Md): count up how often each word occurred, and divide through by the total number of words in the document d.

(4) Optimization: smooth probabilities in our document language models by to discount non-zero probabilities and to give some probability mass to unseen words.

Friday, January 31, 2014

Week 4 Reading Notes: Matching Models: Boolean and Space Vector

1. Boolean retrieval model

(1) Posting list intersection: merging; Query optimization

(2) Extended Boolean model: term proximity; Additional information needed: spelling mistake tolerance, compound or phrases, term frequency, rank(document score)

2. Vector Space Model

(1) Parametric and zone indexes: index and retrieve documents by metadata; simple means for scoring in response to a query.

(2) Term frequency and weighting: based on the statistics of occurrence of the term.

(3) The vector space model for scoring: viewing each document as a vector of such weights, we can compute a score between a query and each document

(4) Variant tf-idf functions: several variants of term-weighting

Week 3 Muddiest Points

For search engines nowadays, such as Google and Baidu, what kind of method they apply to construct index?

Friday, January 24, 2014

Week 3 Reading Notes: Index Construction and Compression

Index Construction

1. Hardware basics: constraints

caching, seek time, buffer

2. Blocked sort-based indexing (BSBI) Θ(T log T)

(i) segments the collection into parts of equal size

(ii) sorts the termID–docID pairs of each part in memory

(iii) stores intermediate sorted results on disk

(iv) merges all intermediate results into the final index.

3. Single-pass in-memory indexing(SPIMI) Θ(T)

SPIMI uses terms instead of termIDs, writes each block’s dictionary to disk, and then starts a new dictionary for the next block.

4. Distributed indexing

For web search engines; apply MapReduce.

5. Dynamic indexing

For collections that are modified frequently with documents being added, deleted and updated; use auxiliary index.

Index Compression

1. Benefits of compression

(1) Less disk space

(2) Increased use of caching

(3) Faster transfer of data from disk to memory

2. Statistical properties of terms

Heaps’ law: Estimating the number of terms；Zipf’s law: Modeling the distribution of terms

3. Dictionary compression, in order to fit the dictionary in main memory and thus reduce disk seeks; store dictionary as String(array of fixed-width), and further compress them by block storage

4. Posting file compression, encode gaps with minimum number of bytes and bits: bytewise compression and bit wise compression

Monday, January 13, 2014

Week 1 Reading Notes: Introduction

1. What is Information Retrieval and how it is developed
"Information retrieval deals with the representation, storage, organization of, and access to information items such as documents, Web pages, online catalogs, structured and semi-structured records, multimedia objects. The representation and organization of the information items should be such as to provide the users with easy access to information of their interest."
It is first used in the field of Libraries. After Web was invented in 1989, IR is heavily used in web search since then. Other search applications including desktop and file system search.

2. Architecture and Process of IR System
Major components of IR System are information need, query, search engine, index, document and results. System query is the parsed and expanded user query.Its performance could be evaluated based on efficiency and effectiveness. Efficiency can be measured in terms of time and space; while effectiveness depends entirely on human judgement of relevance.
When processing, the ranking algorithm of search engine is of great importance. The purpose of ranking is to identify the documents that are mostly likely to be considered relevant by the user, and constitutes the most critical part of the IR system. One method of improve ranking is collecting feedback from the users. In the web the most abundant form of user feedback are the clicks. Another way is to identify sites of high authority.

3. The Web
How Web change search:
Crawling: The inherent distributed nature of the Web collection requires collecting all documents and storing copies of them in a central repository, prior to indexing.
Larger size of collection; larger volume of user queries.
Harder for prediction accuracy.
Difficulty in identifying structured data associated with business object(such as e-commerce).
Filtering web spam.

Week 2 Muddiest Points

Is it true that some pre-processing procedures are required for all IR, such as stop words removal, tokenization and normalization? Especially for tokenization, among the various tokenization approaches introduced in the class, how should we choose from them wisely to use in our Final Project?

Friday, January 10, 2014

Week 2 Reading Notes: Document and Query Processing

Important Concepts:
1. Bag of Word Representation
"A text is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity. "

2. Information Need
"Information need is the topic about which the user desires to know more, and
QUERY is differentiated from a query, which is what the user conveys to the computer
in an attempt to communicate the information need."

3. Tokenization
"Tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. The list of tokens becomes input for further processing such as parsing or text mining."

4. Stemming
"A crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes."

My insights:
According to Chapter 2 of IIR, Common Terms, which are known as stop words should be dropped since they occur in very high frequency and may have little value in matching documents. However, in later part of the Chapter, it is mentioned that when processing two-word phrase queries and applying the Combination Schemes, phrase index such as "The Who" is very helpful in improve searching efficiency. As far as I am concerned, "The" might also be a stop word, if it is dropped as a stop word then the Combination Schemes would not work. Therefore, I think we should be more careful and consider other strategies we might use in the IR System, when choosing the stop list.

Wednesday, January 8, 2014

Week 1 Muddiest Points

1.About the recoverability of IR
Why it is downplayed? Does it mean that IR is more fault-tolerant than Databases?

2. About Null existence search
The example for Null existence search, "a search to proof that no drug is called 'claritin' with a similar name".
Does it mean that this search will try to find all the drugs which has similar name to "claritin" and return yes if nothing is found?