Friday, March 28, 2014

Week 11 Reading Notes: Multilingual and parallel retrieval

Parallel Information Retrieval
Topic: Ways of making information retrieval systems scale to very large text collections, to resolve limitations of computation power, storage capabilities.
1.      Parallel query processing
a)        search engine’s service rate is increased by having multiple index servers process incoming queries in parallel(Document Partitioning, Term Partitioning, Hybrid Schemes)
b)        redundancy and fault tolerance issues in distributed search engines

2.      MapReduce: parallel execution of off-line tasks

Week 10 Muddist Points

As dynamic web programming is widely applied, more and more web pages cannot be parsed by traditional crawler. To retrieve information from those, what kind of technologies are used?

Friday, March 21, 2014

Week 10 Reading Notes: Web information retrieval

1. Web search basics
(1) Background and history about the forces that conspire to make the Web chaotic, fast-changing and very different from the “traditional” collections.

(2) Estimating the number of documents indexed by web search engines, and the elimination of duplicate documents in web indexes, respectively.

2. Link analysis
The use of hyperlinks for ranking web search results
(1)     The use of web graph
(2)     Page rank: the page rank of a node will depend on the link structure of the web graph. Given a query, a web search engine computes a composite score for each web page that combines hundreds of features such as cosine similarity and term proximity, together with the Page Rank score.

(3)     Hyperlink-Induced Topic Search(HITS)