annotate roundup/backends/indexer_common.py @ 3544:5cd1c83dea50

Features and fixes. Feature: - trackers may configure custom stop-words for the full-text indexer Fixed: - fixes in scripts/import_sf.py - fix some unicode bugs in roundup-admin import - Xapian indexer wasn't actually being used - fix indexing of message content on roundup-admin import
author Richard Jones <richard@users.sourceforge.net>
date Mon, 06 Feb 2006 21:00:47 +0000
parents a8c2371f45b6
children 5f4db2650da3
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
3544
5cd1c83dea50 Features and fixes.
Richard Jones <richard@users.sourceforge.net>
parents: 3092
diff changeset
1 #$Id: indexer_common.py,v 1.5 2006-02-06 21:00:47 richard Exp $
5cd1c83dea50 Features and fixes.
Richard Jones <richard@users.sourceforge.net>
parents: 3092
diff changeset
2 import re, sets
3058
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
3
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
4 from roundup import hyperdb
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
5
3544
5cd1c83dea50 Features and fixes.
Richard Jones <richard@users.sourceforge.net>
parents: 3092
diff changeset
6 STOPWORDS = [
5cd1c83dea50 Features and fixes.
Richard Jones <richard@users.sourceforge.net>
parents: 3092
diff changeset
7 "A", "AND", "ARE", "AS", "AT", "BE", "BUT", "BY",
5cd1c83dea50 Features and fixes.
Richard Jones <richard@users.sourceforge.net>
parents: 3092
diff changeset
8 "FOR", "IF", "IN", "INTO", "IS", "IT",
5cd1c83dea50 Features and fixes.
Richard Jones <richard@users.sourceforge.net>
parents: 3092
diff changeset
9 "NO", "NOT", "OF", "ON", "OR", "SUCH",
5cd1c83dea50 Features and fixes.
Richard Jones <richard@users.sourceforge.net>
parents: 3092
diff changeset
10 "THAT", "THE", "THEIR", "THEN", "THERE", "THESE",
5cd1c83dea50 Features and fixes.
Richard Jones <richard@users.sourceforge.net>
parents: 3092
diff changeset
11 "THEY", "THIS", "TO", "WAS", "WILL", "WITH"
3092
a8c2371f45b6 Some cleanup:
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents: 3088
diff changeset
12 ]
a8c2371f45b6 Some cleanup:
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents: 3088
diff changeset
13
3058
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
14 def _isLink(propclass):
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
15 return (isinstance(propclass, hyperdb.Link) or
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
16 isinstance(propclass, hyperdb.Multilink))
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
17
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
18 class Indexer:
3544
5cd1c83dea50 Features and fixes.
Richard Jones <richard@users.sourceforge.net>
parents: 3092
diff changeset
19 def __init__(self, db):
5cd1c83dea50 Features and fixes.
Richard Jones <richard@users.sourceforge.net>
parents: 3092
diff changeset
20 self.stopwords = sets.Set(STOPWORDS)
5cd1c83dea50 Features and fixes.
Richard Jones <richard@users.sourceforge.net>
parents: 3092
diff changeset
21 for word in db.config[('main', 'indexer_stopwords')]:
5cd1c83dea50 Features and fixes.
Richard Jones <richard@users.sourceforge.net>
parents: 3092
diff changeset
22 self.stopwords.add(word)
5cd1c83dea50 Features and fixes.
Richard Jones <richard@users.sourceforge.net>
parents: 3092
diff changeset
23
5cd1c83dea50 Features and fixes.
Richard Jones <richard@users.sourceforge.net>
parents: 3092
diff changeset
24 def is_stopword(self, word):
5cd1c83dea50 Features and fixes.
Richard Jones <richard@users.sourceforge.net>
parents: 3092
diff changeset
25 return word in self.stopwords
5cd1c83dea50 Features and fixes.
Richard Jones <richard@users.sourceforge.net>
parents: 3092
diff changeset
26
3058
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
27 def getHits(self, search_terms, klass):
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
28 return self.find(search_terms)
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
29
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
30 def search(self, search_terms, klass, ignore={}):
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
31 '''Display search results looking for [search, terms] associated
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
32 with the hyperdb Class "klass". Ignore hits on {class: property}.
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
33
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
34 "dre" is a helper, not an argument.
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
35 '''
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
36 # do the index lookup
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
37 hits = self.getHits(search_terms, klass)
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
38 if not hits:
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
39 return {}
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
40
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
41 designator_propname = {}
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
42 for nm, propclass in klass.getprops().items():
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
43 if _isLink(propclass):
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
44 designator_propname[propclass.classname] = nm
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
45
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
46 # build a dictionary of nodes and their associated messages
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
47 # and files
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
48 nodeids = {} # this is the answer
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
49 propspec = {} # used to do the klass.find
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
50 for propname in designator_propname.values():
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
51 propspec[propname] = {} # used as a set (value doesn't matter)
3076
2817a4db901d Change indexer_common.search() to take a list of nodeids...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents: 3058
diff changeset
52 for classname, nodeid, property in hits:
3058
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
53 # skip this result if we don't care about this class/property
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
54 if ignore.has_key((classname, property)):
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
55 continue
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
56
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
57 # if it's a property on klass, it's easy
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
58 if classname == klass.classname:
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
59 if not nodeids.has_key(nodeid):
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
60 nodeids[nodeid] = {}
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
61 continue
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
62
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
63 # make sure the class is a linked one, otherwise ignore
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
64 if not designator_propname.has_key(classname):
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
65 continue
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
66
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
67 # it's a linked class - set up to do the klass.find
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
68 linkprop = designator_propname[classname] # eg, msg -> messages
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
69 propspec[linkprop][nodeid] = 1
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
70
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
71 # retain only the meaningful entries
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
72 for propname, idset in propspec.items():
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
73 if not idset:
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
74 del propspec[propname]
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
75
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
76 # klass.find tells me the klass nodeids the linked nodes relate to
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
77 for resid in klass.find(**propspec):
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
78 resid = str(resid)
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
79 if not nodeids.has_key(id):
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
80 nodeids[resid] = {}
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
81 node_dict = nodeids[resid]
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
82 # now figure out where it came from
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
83 for linkprop in propspec.keys():
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
84 for nodeid in klass.get(resid, linkprop):
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
85 if propspec[linkprop].has_key(nodeid):
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
86 # OK, this node[propname] has a winner
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
87 if not node_dict.has_key(linkprop):
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
88 node_dict[linkprop] = [nodeid]
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
89 else:
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
90 node_dict[linkprop].append(nodeid)
1c063814d567 Move search method duplicated in indexer_dbm and indexer_tsearch2...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents:
diff changeset
91 return nodeids

Roundup Issue Tracker: http://roundup-tracker.org/