Important Concepts:
1. Bag of Word Representation
"A text is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity. "
2. Information Need
"Information need is the topic about which the user desires to know more, and
QUERY is differentiated from a query, which is what the user conveys to the computer
in an attempt to communicate the information need."
3. Tokenization
"Tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. The list of tokens becomes input for further processing such as parsing or text mining."
4. Stemming
"A crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes."
My insights:
According to Chapter 2 of IIR, Common Terms, which are known as stop words should be dropped since they occur in very high frequency and may have little value in matching documents. However, in later part of the Chapter, it is mentioned that when processing two-word phrase queries and applying the Combination Schemes, phrase index such as "The Who" is very helpful in improve searching efficiency. As far as I am concerned, "The" might also be a stop word, if it is dropped as a stop word then the Combination Schemes would not work. Therefore, I think we should be more careful and consider other strategies we might use in the IR System, when choosing the stop list.
No comments:
Post a Comment