Index Construction
1.
Hardware basics: constraints
caching, seek time, buffer
2.
Blocked sort-based indexing (BSBI) Θ(T log T)
(i) segments the collection into
parts of equal size
(ii) sorts the termID–docID pairs of
each part in memory
(iii) stores intermediate sorted
results on disk
(iv) merges all intermediate results
into the final index.
3.
Single-pass in-memory indexing(SPIMI) Θ(T)
SPIMI uses terms instead of termIDs, writes each block’s
dictionary to disk, and then starts a new dictionary for the next block.
4.
Distributed indexing
For web search engines; apply MapReduce.
5.
Dynamic indexing
For collections that are modified frequently with documents
being added, deleted and updated; use auxiliary index.
Index Compression
Index Compression
1.
Benefits of compression
(1)
Less disk space
(2)
Increased use of caching
(3)
Faster transfer of data from
disk to memory
2.
Statistical properties of terms
Heaps’ law: Estimating the number of terms;Zipf’s law:
Modeling the distribution of terms
3.
Dictionary compression, in order
to fit the dictionary in main memory and thus reduce disk seeks; store
dictionary as String(array of fixed-width), and further compress them by block
storage
4.
Posting file compression,
encode gaps with minimum number of bytes and bits: bytewise compression and bit
wise compression
No comments:
Post a Comment