Blog for Class IS2140: January 2014

Friday, January 31, 2014

Week 4 Reading Notes: Matching Models: Boolean and Space Vector

1. Boolean retrieval model

(1) Posting list intersection: merging; Query optimization

(2) Extended Boolean model: term proximity; Additional information needed: spelling mistake tolerance, compound or phrases, term frequency, rank(document score)

2. Vector Space Model

(1) Parametric and zone indexes: index and retrieve documents by metadata; simple means for scoring in response to a query.

(2) Term frequency and weighting: based on the statistics of occurrence of the term.

(3) The vector space model for scoring: viewing each document as a vector of such weights, we can compute a score between a query and each document

(4) Variant tf-idf functions: several variants of term-weighting

Week 3 Muddiest Points

For search engines nowadays, such as Google and Baidu, what kind of method they apply to construct index?

Friday, January 24, 2014

Week 3 Reading Notes: Index Construction and Compression

Index Construction

1. Hardware basics: constraints

caching, seek time, buffer

2. Blocked sort-based indexing (BSBI) Θ(T log T)

(i) segments the collection into parts of equal size

(ii) sorts the termID–docID pairs of each part in memory

(iii) stores intermediate sorted results on disk

(iv) merges all intermediate results into the final index.

3. Single-pass in-memory indexing(SPIMI) Θ(T)

SPIMI uses terms instead of termIDs, writes each block’s dictionary to disk, and then starts a new dictionary for the next block.

4. Distributed indexing

For web search engines; apply MapReduce.

5. Dynamic indexing

For collections that are modified frequently with documents being added, deleted and updated; use auxiliary index.

Index Compression

1. Benefits of compression

(1) Less disk space

(2) Increased use of caching

(3) Faster transfer of data from disk to memory

2. Statistical properties of terms

Heaps’ law: Estimating the number of terms；Zipf’s law: Modeling the distribution of terms

3. Dictionary compression, in order to fit the dictionary in main memory and thus reduce disk seeks; store dictionary as String(array of fixed-width), and further compress them by block storage

4. Posting file compression, encode gaps with minimum number of bytes and bits: bytewise compression and bit wise compression

Monday, January 13, 2014

Week 1 Reading Notes: Introduction

1. What is Information Retrieval and how it is developed
"Information retrieval deals with the representation, storage, organization of, and access to information items such as documents, Web pages, online catalogs, structured and semi-structured records, multimedia objects. The representation and organization of the information items should be such as to provide the users with easy access to information of their interest."
It is first used in the field of Libraries. After Web was invented in 1989, IR is heavily used in web search since then. Other search applications including desktop and file system search.

2. Architecture and Process of IR System
Major components of IR System are information need, query, search engine, index, document and results. System query is the parsed and expanded user query.Its performance could be evaluated based on efficiency and effectiveness. Efficiency can be measured in terms of time and space; while effectiveness depends entirely on human judgement of relevance.
When processing, the ranking algorithm of search engine is of great importance. The purpose of ranking is to identify the documents that are mostly likely to be considered relevant by the user, and constitutes the most critical part of the IR system. One method of improve ranking is collecting feedback from the users. In the web the most abundant form of user feedback are the clicks. Another way is to identify sites of high authority.

3. The Web
How Web change search:
Crawling: The inherent distributed nature of the Web collection requires collecting all documents and storing copies of them in a central repository, prior to indexing.
Larger size of collection; larger volume of user queries.
Harder for prediction accuracy.
Difficulty in identifying structured data associated with business object(such as e-commerce).
Filtering web spam.

Week 2 Muddiest Points

Is it true that some pre-processing procedures are required for all IR, such as stop words removal, tokenization and normalization? Especially for tokenization, among the various tokenization approaches introduced in the class, how should we choose from them wisely to use in our Final Project?

Friday, January 10, 2014

Week 2 Reading Notes: Document and Query Processing

Important Concepts:
1. Bag of Word Representation
"A text is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity. "

2. Information Need
"Information need is the topic about which the user desires to know more, and
QUERY is differentiated from a query, which is what the user conveys to the computer
in an attempt to communicate the information need."

3. Tokenization
"Tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. The list of tokens becomes input for further processing such as parsing or text mining."

4. Stemming
"A crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes."

My insights:
According to Chapter 2 of IIR, Common Terms, which are known as stop words should be dropped since they occur in very high frequency and may have little value in matching documents. However, in later part of the Chapter, it is mentioned that when processing two-word phrase queries and applying the Combination Schemes, phrase index such as "The Who" is very helpful in improve searching efficiency. As far as I am concerned, "The" might also be a stop word, if it is dropped as a stop word then the Combination Schemes would not work. Therefore, I think we should be more careful and consider other strategies we might use in the IR System, when choosing the stop list.

Wednesday, January 8, 2014

Week 1 Muddiest Points

1.About the recoverability of IR
Why it is downplayed? Does it mean that IR is more fault-tolerant than Databases?

2. About Null existence search
The example for Null existence search, "a search to proof that no drug is called 'claritin' with a similar name".
Does it mean that this search will try to find all the drugs which has similar name to "claritin" and return yes if nothing is found?