Friday, January 24, 2014

Week 3 Reading Notes: Index Construction and Compression

       Index Construction

1.       Hardware basics: constraints
caching, seek time, buffer

2.       Blocked sort-based indexing (BSBI)           Θ(T log T)
(i) segments the collection into parts of equal size
(ii) sorts the termID–docID pairs of each part in memory
(iii) stores intermediate sorted results on disk
(iv)  merges all intermediate results into the final index.

3.       Single-pass in-memory indexing(SPIMI)             Θ(T)
SPIMI uses terms instead of termIDs, writes each block’s dictionary to disk, and then starts a new dictionary for the next block.

4.       Distributed indexing
For web search engines; apply MapReduce.

5.       Dynamic indexing
For collections that are modified frequently with documents being added, deleted and updated; use auxiliary index.

 Index Compression
1.       Benefits of compression
(1)     Less disk space
(2)     Increased use of caching
(3)     Faster transfer of data from disk to memory
2.       Statistical properties of terms
Heaps’ law: Estimating the number of termsZipf’s law: Modeling the distribution of terms
3.       Dictionary compression, in order to fit the dictionary in main memory and thus reduce disk seeks; store dictionary as String(array of fixed-width), and further compress them by block storage

4.       Posting file compression, encode gaps with minimum number of bytes and bits: bytewise compression and bit wise compression


No comments:

Post a Comment