Information retrieval dynamic indexing

Information Retrieval
CSE 840

Presenter
NADIA NAHAR
BSSE 0327
2

Why Dynamic Indexing??
• Collections are not static
• Documents come in over time and need to
be inserted
• Documents are often deleted and modified
• So the dictionary and postings lists need to
be modified:
– Postings updates for terms already in
dictionary
– New terms added to dictionary
4

Simplest approach
Maintain “big” main index
New docs go into “small” auxiliary index
Search across both, merge results
Invalidation bit-vector for deleted docs
Filter docs output on a search result by this
invalidation bit-vector
Documents are updated by deleting and
reinserting them
5

Merge
Results
Search Result
Search Result
6

Simplest approach
reinserting them
7

Simplest approach
reinserting them
8

Issues with main and auxiliary indexes
• Problem of frequent merges – you touch stuff a lot
• Poor performance during merge
• Actually:
– Merging of the auxiliary index into the main index is efficient if we
keep a separate file for each postings list.
– Merge is the same as a simple append.
– But then we would need a lot of files – inefficient for OS.
9

Logarithmic merge
• Maintain a series of indexes, each twice as
large as the previous one
– At any time, some of these powers of 2 are
instantiated
• Keep smallest (Z0) in memory
• Larger ones (I0, I1, …) on disk
• If Z0 gets too big (> n), write to disk as I0
• or merge with I0 (if I0 already exists) as Z1
• Either write merge Z1 to disk as I1 (if no I1)
• Or merge with I1 to form Z2
10

Logarithmic merge
• Auxiliary and main index: index construction
time is O(T2) as each posting is touched in
each merge.
• Logarithmic merge: Each posting is merged
O(log T) times, so complexity is O(T log T)
• So logarithmic merge is much more efficient
for index construction
• But query processing now requires the
merging of O(log T) indexes
– Whereas it is O(1) if you just have a main and
auxiliary index
13

Further issues with multiple indexes
• Collection-wide statistics are hard to
maintain
• E.g., spell-correction: which of several
corrected alternatives do we present to the
user?
– pick the one with the most hits
• How do we maintain the top ones with
multiple indexes and invalidation bit vectors?
– One possibility: ignore everything but the main
index for such ordering
14

Dynamic indexing at search engines
• All the large search engines now do dynamic
indexing
• Their indices have frequent incremental
changes
– News items, blogs, new topical web pages
• But (sometimes/typically) they also
periodically reconstruct the index from
scratch
– Query processing is then switched to the new
index, and the old index is deleted
15

Information retrieval dynamic indexing

More Related Content

What's hot

Viewers also liked

Similar to Information retrieval dynamic indexing

More from Nadia Nahar

Recently uploaded

Information retrieval dynamic indexing