Blog for Class IS2140: Week 3 Reading Notes: Index Construction and Compression

Index Construction

1. Hardware basics: constraints

caching, seek time, buffer

2. Blocked sort-based indexing (BSBI) Θ(T log T)

(i) segments the collection into parts of equal size

(ii) sorts the termID–docID pairs of each part in memory

(iii) stores intermediate sorted results on disk

(iv) merges all intermediate results into the final index.

3. Single-pass in-memory indexing(SPIMI) Θ(T)

SPIMI uses terms instead of termIDs, writes each block’s dictionary to disk, and then starts a new dictionary for the next block.

4. Distributed indexing

For web search engines; apply MapReduce.

5. Dynamic indexing

For collections that are modified frequently with documents being added, deleted and updated; use auxiliary index.

Index Compression

1. Benefits of compression

(1) Less disk space

(2) Increased use of caching

(3) Faster transfer of data from disk to memory

2. Statistical properties of terms

Heaps’ law: Estimating the number of terms；Zipf’s law: Modeling the distribution of terms

3. Dictionary compression, in order to fit the dictionary in main memory and thus reduce disk seeks; store dictionary as String(array of fixed-width), and further compress them by block storage

4. Posting file compression, encode gaps with minimum number of bytes and bits: bytewise compression and bit wise compression

Blog for Class IS2140

Friday, January 24, 2014

Week 3 Reading Notes: Index Construction and Compression

No comments:

Post a Comment