Traditional adaptive ltering systems learn the user's interests in a rather simple way { words from relevant
documents are favored in the query model, while words from irrelevant documents are down-weighted. This biases the query model towards speci c words seen in the past, causing the system to favor documents containing relevant but redundant information over documents that use previously unseen words to denote new facts about the same news event. This paper proposes news ways of generalizing from relevance feedback by augmenting the traditional bag-of-words query model with named entity wildcards that
are anchored in context. The use of wildcards allows generalization beyond speci c words, while contextual
restrictions limit the wildcard-matching to entities related to the user's query. We test our new approach in a nugget-level adaptive ltering system and evaluate it in terms of both relevance and novelty of the presented information. Our results indicate that higher recall is obtained when lexical terms are generalized using wildcards. However, such wildcards must be anchored to their context to maintain good precision. How the context of a wildcard is represented and matched against a given document also plays a crucial role in the performance of the retrieval system.
Saturday, April 19, 2014
Friday, April 11, 2014
Week 13 Reading Notes: Text classification and clustering
1.
Text classification and Naïve Bayes:
Many users have ongoing information
needs, One way of doing this is to issue the query multicore against an index
of recent newswire articles each morning. A standing query is like any other query
except that it is periodically executed on a collection to which new documents
are incrementally added over time.
2.
Vector space classification
Adopt a different representation for
text classification, the vector space model. It represents each document as a
vector with one real-valued component, usually a tf-idf weight, for each term.
3.
Flat clustering
(1)
Clustering algorithms group a
set of documents into subsets CLUSTER or clusters. The algorithms’ goal is to
create clusters that are coherent internally, but clearly different fromeach
other. In other words, documents within a cluster should be as similar as
possible; and documents in one cluster should be as dissimilar as possible from
documents in other clusters.
(2)
Cluster hypothesis. Documents
in the same cluster behave similarly with respect to relevance to information
needs.
4.
Hierarchical clustering: single-link,
completelink, group-average, and centroid similarity
Friday, April 4, 2014
Week 12 Reading Notes: Intelligent information retrieval
1. User Profiles for
Personalized Information Access
(1) User Profiling: gather,
and exploit, some information about individuals in order to be effective.
(2) Explicit User
Information Collection: user feedback, customization, navigation
(3) Implicit User
Information Collection: browser caches, proxy servers, browser agents, desktop
agents, and search logs.
2. Content-Based
Recommendation Systems
describing the items that
may be recommended, a means for creating a profile of the user that describes
the types of items the user likes, and a means of comparing items to the user
profile to determine what to recommend
(1)
Item
Representation: Items that can be recommended to the user are often stored in a
database table.
(2)
User Profiles:
A profile of the user’s interests
(3)
Learning a
User Model: Creating a model of the user’s preference from the user history
(4)
Decision
Trees and Rule Induction
(5)
Nearest
Neighbor Methods
(6)
Relevance
Feedback and Rocchio’s Algorithm
(7)
Linear
Classifiers
(8)
Probabilistic
Methods and Naïve Bayes
Friday, March 28, 2014
Week 11 Reading Notes: Multilingual and parallel retrieval
Parallel Information
Retrieval
Topic: Ways of making
information retrieval systems scale to very large text collections, to resolve
limitations of computation power, storage capabilities.
1.
Parallel
query processing
a)
search
engine’s service rate is increased by having multiple index servers process
incoming queries in parallel(Document Partitioning, Term Partitioning, Hybrid
Schemes)
b)
redundancy
and fault tolerance issues in distributed search engines
2.
MapReduce:
parallel execution of off-line tasks
Week 10 Muddist Points
As dynamic web programming is widely applied, more and more web pages cannot be parsed by traditional crawler. To retrieve information from those, what kind of technologies are used?
Friday, March 21, 2014
Week 10 Reading Notes: Web information retrieval
1. Web search basics
(1) Background and history about the forces that conspire
to make the Web chaotic, fast-changing and very different from the
“traditional” collections.
(2) Estimating the number of documents
indexed by web search engines, and the elimination of duplicate documents in
web indexes, respectively.
2. Link analysis
The use of hyperlinks for ranking
web search results
(1)
The use of web graph
(2)
Page rank: the page rank of a
node will depend on the link structure of the web graph. Given a query, a web
search engine computes a composite score for each web page that combines
hundreds of features such as cosine similarity and term proximity, together with
the Page Rank score.
(3)
Hyperlink-Induced Topic Search(HITS)
Sunday, March 9, 2014
Thursday, February 27, 2014
Week 8 Reading Notes: User interaction and visualization
1.
Effective human-computer
interface
(1)
Design principles: offer
informative feedback
(2)
Reduce working memory load
(3)
Provide alternative interfaces
for novice and expert users.
Visualization, Evaluating Interactive
Systems
2.
Information Access Process
3.
Starting Point
List of collections, overviews; examples,
dialogs and wizards; automated source selection
4.
Query specification
5.
Context
Document surrogates, query term hits, superbook,
categories, using hyperlinks to organize retrieval results, tables
6.
Relevance judgements
Interfaces, studies of user interaction,
group relevance judgments, pseudo-relevance judgment feedback
7.
Interface support for search
process
String matching, window management, example
systems
Monday, February 17, 2014
Week 7 Reading Notes: Relevance feedback and query expansion
1.
The issue of synonymy: impact on
the recall of most information retrieval systems.
Solution: Query refinement,automatically or
user in the loop
(1)
Global methods include:
• Query expansion/reformulation with a thesaurus or WordNet
• Query expansion via automatic thesaurus generation
• Techniques like spelling correction
(2)
Local methods adjust a query
relative to the documents that initially appear to match the query. The basic
methods here are:
• Relevance feedback
• Pseudo relevance feedback, also
known as Blind relevance feedback
2.
Relevance feedback (RF): involve
the user in the retrieval process so as to improve the final result set.
(1)
Basic procedure:
• The user issues a
(short, simple) query.
• The system returns
an initial set of retrieval results.
• The user marks some
returned documents as relevant or nonrelevant.
• The system computes
a better representation of the information need based on the user feedback.
• The system displays a revised set of retrieval results.
(2)
The Rocchio algorithm
(3)
very effective at improving
relevance of results
Week 6 Muddist Points
When
measuring ranked retrieval, by computing the precision at fixed recall points,
it's possible to compare the performance of different IR systems. But since
both precision and recall are very important evaluation measurements, how to
combine these two together to make a better judgement?
Are there methods act the same as the harmonic mean?
Friday, February 14, 2014
Week 6 Reading Notes: Evaluation
Information retrieval has developed as a
highly empirical discipline, requiring careful and thorough evaluation to
demonstrate the superior performance of novel techniques on representative document
collections.
1.
Test collection:
A document collection;
A test suite of
information needs, expressible as queries;
A set of relevance judgments, standardly a binary
assessment of either relevant or nonrelevant for each query-document pair.
Standard test collections: Cranfield, Text Retrieval
Conference (TREC)
2.
Evaluation of unranked
retrieval sets
(1)
Precision (P) is the fraction
of retrieved documents that are relevant
(2)
Recall (R) is the fraction of
relevant documents that are retrieved
(3)
accuracy is the fraction of its
classifications that are correct
(4)
F measure, which is the
weighted harmonic mean of precision and recall
3.
Evaluation of ranked retrieval
sets
precision-recall curve, interpolated precision
4.
Assessing relevance
Pooling: where relevance is assessed over a subset of the
collection that is formed from the top k documents returned by a number of
different IR systems
marginal
relevance: whether a document still has distinctive usefulness after the user
has looked at certain other documents
5.
System quality and user utility
(1)
User utility: a way of
quantifying aggregate user happiness, based on the relevance, speed, and user
interface of a system
(2)
Refining a deployed system: A/B
TEST
Week 5 Muddist Points
What is the biggest difference between
Language Model and Classic Probabilistic Model, since both of them heavily use
probability?
Monday, February 3, 2014
Week 4 Muddiest Points
In vector space model, should log-frequency weighting be calculated before or after stop words removal? how about normalization?
Week 5 Reading Notes: Matching models- probabilistic and language model
Probabilistic
information retrieval
estimate
the probability of a term t appearing in a relevant document P(t|R = 1), and
that this could be the basis of a classifier that decides whether documents are
relevant or not.
1.
basic
probability theory: prior probability, posterior probability, odds
2.
The
Probability Ranking Principle: The 1/0 loss case
3.
Binary
dependence model
(1) pt = P(xt = 1|R =
1,~q): probability of a term appearing in a document relevant to the query; ut =
P(xt = 1|R = 0,~q): be the probability of a term appearing in a nonrelevant
document
(2) Determine a
guess for the size of the relevant document set.
(3) Improve our
guesses for pt and ut.
(4) Go to step 2
until the ranking of the returned results converges.
Language models
A
document is a good match to a query if the document model is likely to generate
the query, which will in turn happen if the document contains the query words
often.
The
basic language modeling approach builds a probabilistic language model Md from
each document d, and ranks documents based on the probability of the model
generating the query: P(q|Md).
1.
Language
models:
Generative
model
Language
of automation: the full set of strings that can be generated from formal
language theory
Language
model: a function that puts a probability measure over strings drawn from some
vocabulary
unigram
language model: Puni(t1t2t3t4) =
P(t1)P(t2)P(t3)P(t4)
2.
The
query likelihood model
(1) Goal: rank documents by P(d|q)
(2) Method: using Bayes Rule P(d|q) =
P(q|d)P(d)/P(q)
à
P(q|d), the probability of the query q under the language model derived from d;
a
query would be observed as a random sample from the respective document model
(P(d)
and P(q) is uniform and therefore usually ignored)
à multinomial
Naive Bayes model(page 263)
(3)
estimate P(q|Md): count up how often each word occurred, and divide through by
the total number of words in the document d.
(4)
Optimization: smooth probabilities in our document language models by to
discount non-zero probabilities and to give some probability mass to unseen
words.
Friday, January 31, 2014
Week 4 Reading Notes: Matching Models: Boolean and Space Vector
1.
Boolean retrieval model
(1)
Posting list intersection:
merging; Query optimization
(2)
Extended Boolean model: term
proximity; Additional information needed: spelling mistake tolerance, compound
or phrases, term frequency, rank(document score)
2.
Vector Space Model
(1)
Parametric and zone indexes: index
and retrieve documents by metadata; simple means for scoring in response to a
query.
(2)
Term frequency and weighting: based
on the statistics of occurrence of the term.
(3)
The vector space model for
scoring: viewing each document as a vector of such weights, we can compute a
score between a query and each document
(4)
Variant tf-idf functions: several
variants of term-weighting
Week 3 Muddiest Points
For search engines nowadays, such as Google and Baidu, what kind of method they apply to construct index?
Friday, January 24, 2014
Week 3 Reading Notes: Index Construction and Compression
Index Construction
1.
Hardware basics: constraints
caching, seek time, buffer
2.
Blocked sort-based indexing (BSBI) Θ(T log T)
(i) segments the collection into
parts of equal size
(ii) sorts the termID–docID pairs of
each part in memory
(iii) stores intermediate sorted
results on disk
(iv) merges all intermediate results
into the final index.
3.
Single-pass in-memory indexing(SPIMI) Θ(T)
SPIMI uses terms instead of termIDs, writes each block’s
dictionary to disk, and then starts a new dictionary for the next block.
4.
Distributed indexing
For web search engines; apply MapReduce.
5.
Dynamic indexing
For collections that are modified frequently with documents
being added, deleted and updated; use auxiliary index.
Index Compression
Index Compression
1.
Benefits of compression
(1)
Less disk space
(2)
Increased use of caching
(3)
Faster transfer of data from
disk to memory
2.
Statistical properties of terms
Heaps’ law: Estimating the number of terms;Zipf’s law:
Modeling the distribution of terms
3.
Dictionary compression, in order
to fit the dictionary in main memory and thus reduce disk seeks; store
dictionary as String(array of fixed-width), and further compress them by block
storage
4.
Posting file compression,
encode gaps with minimum number of bytes and bits: bytewise compression and bit
wise compression
Monday, January 13, 2014
Week 1 Reading Notes: Introduction
1. What is Information Retrieval and how it is developed
"Information retrieval deals with the representation, storage, organization of, and access to information items such as documents, Web pages, online catalogs, structured and semi-structured records, multimedia objects. The representation and organization of the information items should be such as to provide the users with easy access to information of their interest."
It is first used in the field of Libraries. After Web was invented in 1989, IR is heavily used in web search since then. Other search applications including desktop and file system search.
2. Architecture and Process of IR System
Major components of IR System are information need, query, search engine, index, document and results. System query is the parsed and expanded user query.Its performance could be evaluated based on efficiency and effectiveness. Efficiency can be measured in terms of time and space; while effectiveness depends entirely on human judgement of relevance.
When processing, the ranking algorithm of search engine is of great importance. The purpose of ranking is to identify the documents that are mostly likely to be considered relevant by the user, and constitutes the most critical part of the IR system. One method of improve ranking is collecting feedback from the users. In the web the most abundant form of user feedback are the clicks. Another way is to identify sites of high authority.
3. The Web
How Web change search:
Crawling: The inherent distributed nature of the Web collection requires collecting all documents and storing copies of them in a central repository, prior to indexing.
Larger size of collection; larger volume of user queries.
Harder for prediction accuracy.
Difficulty in identifying structured data associated with business object(such as e-commerce).
Filtering web spam.
"Information retrieval deals with the representation, storage, organization of, and access to information items such as documents, Web pages, online catalogs, structured and semi-structured records, multimedia objects. The representation and organization of the information items should be such as to provide the users with easy access to information of their interest."
It is first used in the field of Libraries. After Web was invented in 1989, IR is heavily used in web search since then. Other search applications including desktop and file system search.
2. Architecture and Process of IR System
Major components of IR System are information need, query, search engine, index, document and results. System query is the parsed and expanded user query.Its performance could be evaluated based on efficiency and effectiveness. Efficiency can be measured in terms of time and space; while effectiveness depends entirely on human judgement of relevance.
When processing, the ranking algorithm of search engine is of great importance. The purpose of ranking is to identify the documents that are mostly likely to be considered relevant by the user, and constitutes the most critical part of the IR system. One method of improve ranking is collecting feedback from the users. In the web the most abundant form of user feedback are the clicks. Another way is to identify sites of high authority.
3. The Web
How Web change search:
Crawling: The inherent distributed nature of the Web collection requires collecting all documents and storing copies of them in a central repository, prior to indexing.
Larger size of collection; larger volume of user queries.
Harder for prediction accuracy.
Difficulty in identifying structured data associated with business object(such as e-commerce).
Filtering web spam.
Week 2 Muddiest Points
Is it true that some pre-processing procedures are required for all IR, such as stop words removal, tokenization and normalization? Especially for tokenization, among the various tokenization approaches introduced in the class, how should we choose from them wisely to use in our Final Project?
Friday, January 10, 2014
Week 2 Reading Notes: Document and Query Processing
Important Concepts:
1. Bag of Word Representation
"A text is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity. "
2. Information Need
"Information need is the topic about which the user desires to know more, and
QUERY is differentiated from a query, which is what the user conveys to the computer
in an attempt to communicate the information need."
3. Tokenization
"Tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. The list of tokens becomes input for further processing such as parsing or text mining."
4. Stemming
"A crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes."
My insights:
According to Chapter 2 of IIR, Common Terms, which are known as stop words should be dropped since they occur in very high frequency and may have little value in matching documents. However, in later part of the Chapter, it is mentioned that when processing two-word phrase queries and applying the Combination Schemes, phrase index such as "The Who" is very helpful in improve searching efficiency. As far as I am concerned, "The" might also be a stop word, if it is dropped as a stop word then the Combination Schemes would not work. Therefore, I think we should be more careful and consider other strategies we might use in the IR System, when choosing the stop list.
1. Bag of Word Representation
"A text is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity. "
2. Information Need
"Information need is the topic about which the user desires to know more, and
QUERY is differentiated from a query, which is what the user conveys to the computer
in an attempt to communicate the information need."
3. Tokenization
"Tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. The list of tokens becomes input for further processing such as parsing or text mining."
4. Stemming
"A crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes."
My insights:
According to Chapter 2 of IIR, Common Terms, which are known as stop words should be dropped since they occur in very high frequency and may have little value in matching documents. However, in later part of the Chapter, it is mentioned that when processing two-word phrase queries and applying the Combination Schemes, phrase index such as "The Who" is very helpful in improve searching efficiency. As far as I am concerned, "The" might also be a stop word, if it is dropped as a stop word then the Combination Schemes would not work. Therefore, I think we should be more careful and consider other strategies we might use in the IR System, when choosing the stop list.
Wednesday, January 8, 2014
Week 1 Muddiest Points
1.About the recoverability of IR
Why it is downplayed? Does it mean that IR is more fault-tolerant than Databases?
2. About Null existence search
The example for Null existence search, "a search to proof that no drug is called 'claritin' with a similar name".
Does it mean that this search will try to find all the drugs which has similar name to "claritin" and return yes if nothing is found?
Why it is downplayed? Does it mean that IR is more fault-tolerant than Databases?
2. About Null existence search
The example for Null existence search, "a search to proof that no drug is called 'claritin' with a similar name".
Does it mean that this search will try to find all the drugs which has similar name to "claritin" and return yes if nothing is found?
Subscribe to:
Comments (Atom)