Information Retrieval 
CSE 840
Presenter 
NADIA NAHAR 
BSSE 0327 
2
Topic 
DYNAMIC INDEXING 
3
Why Dynamic Indexing?? 
• Collections are not static 
• Documents come in over time and need to 
be inserted 
• Documents are often deleted and modified 
• So the dictionary and postings lists need to 
be modified: 
– Postings updates for terms already in 
dictionary 
– New terms added to dictionary 
4
Simplest approach 
Maintain “big” main index 
New docs go into “small” auxiliary index 
Search across both, merge results 
Invalidation bit-vector for deleted docs 
Filter docs output on a search result by this 
invalidation bit-vector 
Documents are updated by deleting and 
reinserting them 
5
Merge 
Results 
Search Result 
Search Result 
6
Simplest approach 
Maintain “big” main index 
New docs go into “small” auxiliary index 
Search across both, merge results 
Invalidation bit-vector for deleted docs 
Filter docs output on a search result by this 
invalidation bit-vector 
Documents are updated by deleting and 
reinserting them 
7
Simplest approach 
Maintain “big” main index 
New docs go into “small” auxiliary index 
Search across both, merge results 
Invalidation bit-vector for deleted docs 
Filter docs output on a search result by this 
invalidation bit-vector 
Documents are updated by deleting and 
reinserting them 
8
Issues with main and auxiliary indexes 
• Problem of frequent merges – you touch stuff a lot 
• Poor performance during merge 
• Actually: 
– Merging of the auxiliary index into the main index is efficient if we 
keep a separate file for each postings list. 
– Merge is the same as a simple append. 
– But then we would need a lot of files – inefficient for OS. 
9
Logarithmic merge 
• Maintain a series of indexes, each twice as 
large as the previous one 
– At any time, some of these powers of 2 are 
instantiated 
• Keep smallest (Z0) in memory 
• Larger ones (I0, I1, …) on disk 
• If Z0 gets too big (> n), write to disk as I0 
• or merge with I0 (if I0 already exists) as Z1 
• Either write merge Z1 to disk as I1 (if no I1) 
• Or merge with I1 to form Z2 
10
11
12
Logarithmic merge 
• Auxiliary and main index: index construction 
time is O(T2) as each posting is touched in 
each merge. 
• Logarithmic merge: Each posting is merged 
O(log T) times, so complexity is O(T log T) 
• So logarithmic merge is much more efficient 
for index construction 
• But query processing now requires the 
merging of O(log T) indexes 
– Whereas it is O(1) if you just have a main and 
auxiliary index 
13
Further issues with multiple indexes 
• Collection-wide statistics are hard to 
maintain 
• E.g., spell-correction: which of several 
corrected alternatives do we present to the 
user? 
– pick the one with the most hits 
• How do we maintain the top ones with 
multiple indexes and invalidation bit vectors? 
– One possibility: ignore everything but the main 
index for such ordering 
14
Dynamic indexing at search engines 
• All the large search engines now do dynamic 
indexing 
• Their indices have frequent incremental 
changes 
– News items, blogs, new topical web pages 
• But (sometimes/typically) they also 
periodically reconstruct the index from 
scratch 
– Query processing is then switched to the new 
index, and the old index is deleted 
15
16
17

Information retrieval dynamic indexing

  • 1.
  • 2.
  • 3.
  • 4.
    Why Dynamic Indexing?? • Collections are not static • Documents come in over time and need to be inserted • Documents are often deleted and modified • So the dictionary and postings lists need to be modified: – Postings updates for terms already in dictionary – New terms added to dictionary 4
  • 5.
    Simplest approach Maintain“big” main index New docs go into “small” auxiliary index Search across both, merge results Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation bit-vector Documents are updated by deleting and reinserting them 5
  • 6.
    Merge Results SearchResult Search Result 6
  • 7.
    Simplest approach Maintain“big” main index New docs go into “small” auxiliary index Search across both, merge results Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation bit-vector Documents are updated by deleting and reinserting them 7
  • 8.
    Simplest approach Maintain“big” main index New docs go into “small” auxiliary index Search across both, merge results Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation bit-vector Documents are updated by deleting and reinserting them 8
  • 9.
    Issues with mainand auxiliary indexes • Problem of frequent merges – you touch stuff a lot • Poor performance during merge • Actually: – Merging of the auxiliary index into the main index is efficient if we keep a separate file for each postings list. – Merge is the same as a simple append. – But then we would need a lot of files – inefficient for OS. 9
  • 10.
    Logarithmic merge •Maintain a series of indexes, each twice as large as the previous one – At any time, some of these powers of 2 are instantiated • Keep smallest (Z0) in memory • Larger ones (I0, I1, …) on disk • If Z0 gets too big (> n), write to disk as I0 • or merge with I0 (if I0 already exists) as Z1 • Either write merge Z1 to disk as I1 (if no I1) • Or merge with I1 to form Z2 10
  • 11.
  • 12.
  • 13.
    Logarithmic merge •Auxiliary and main index: index construction time is O(T2) as each posting is touched in each merge. • Logarithmic merge: Each posting is merged O(log T) times, so complexity is O(T log T) • So logarithmic merge is much more efficient for index construction • But query processing now requires the merging of O(log T) indexes – Whereas it is O(1) if you just have a main and auxiliary index 13
  • 14.
    Further issues withmultiple indexes • Collection-wide statistics are hard to maintain • E.g., spell-correction: which of several corrected alternatives do we present to the user? – pick the one with the most hits • How do we maintain the top ones with multiple indexes and invalidation bit vectors? – One possibility: ignore everything but the main index for such ordering 14
  • 15.
    Dynamic indexing atsearch engines • All the large search engines now do dynamic indexing • Their indices have frequent incremental changes – News items, blogs, new topical web pages • But (sometimes/typically) they also periodically reconstruct the index from scratch – Query processing is then switched to the new index, and the old index is deleted 15
  • 16.
  • 17.