annotate roundup/backends/indexer_rdbms.py @ 3109:b2a5792b4e5c

updated German translation
author Richard Jones <richard@users.sourceforge.net>
date Thu, 13 Jan 2005 23:08:12 +0000
parents a8c2371f45b6
children a615cc230160
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
3092
a8c2371f45b6 Some cleanup:
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents: 3088
diff changeset
1 #$Id: indexer_rdbms.py,v 1.8 2005-01-08 16:16:59 jlgijsbers Exp $
2093
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
2 ''' This implements the full-text indexer over two RDBMS tables. The first
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
3 is a mapping of words to occurance IDs. The second maps the IDs to (Class,
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
4 propname, itemid) instances.
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
5 '''
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
6 import re
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
7
3092
a8c2371f45b6 Some cleanup:
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents: 3088
diff changeset
8 from indexer_common import Indexer, is_stopword
2093
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
9
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
10 class Indexer(Indexer):
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
11 def __init__(self, db):
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
12 self.db = db
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
13 self.reindex = 0
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
14
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
15 def close(self):
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
16 '''close the indexing database'''
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
17 # just nuke the circular reference
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
18 self.db = None
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
19
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
20 def force_reindex(self):
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
21 '''Force a reindexing of the database. This essentially
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
22 empties the tables ids and index and sets a flag so
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
23 that the databases are reindexed'''
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
24 self.reindex = 1
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
25
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
26 def should_reindex(self):
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
27 '''returns True if the indexes need to be rebuilt'''
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
28 return self.reindex
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
29
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
30 def add_text(self, identifier, text, mime_type='text/plain'):
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
31 ''' "identifier" is (classname, itemid, property) '''
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
32 if mime_type != 'text/plain':
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
33 return
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
34
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
35 # first, find the id of the (classname, itemid, property)
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
36 a = self.db.arg
2098
18addf2a8596 Implemented proper datatypes in mysql and postgresql backends...
Richard Jones <richard@users.sourceforge.net>
parents: 2093
diff changeset
37 sql = 'select _textid from __textids where _class=%s and '\
2093
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
38 '_itemid=%s and _prop=%s'%(a, a, a)
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
39 self.db.cursor.execute(sql, identifier)
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
40 r = self.db.cursor.fetchone()
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
41 if not r:
2098
18addf2a8596 Implemented proper datatypes in mysql and postgresql backends...
Richard Jones <richard@users.sourceforge.net>
parents: 2093
diff changeset
42 id = self.db.newid('__textids')
18addf2a8596 Implemented proper datatypes in mysql and postgresql backends...
Richard Jones <richard@users.sourceforge.net>
parents: 2093
diff changeset
43 sql = 'insert into __textids (_textid, _class, _itemid, _prop)'\
2093
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
44 ' values (%s, %s, %s, %s)'%(a, a, a, a)
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
45 self.db.cursor.execute(sql, (id, ) + identifier)
2098
18addf2a8596 Implemented proper datatypes in mysql and postgresql backends...
Richard Jones <richard@users.sourceforge.net>
parents: 2093
diff changeset
46 self.db.cursor.execute('select max(_textid) from __textids')
18addf2a8596 Implemented proper datatypes in mysql and postgresql backends...
Richard Jones <richard@users.sourceforge.net>
parents: 2093
diff changeset
47 id = self.db.cursor.fetchone()[0]
2093
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
48 else:
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
49 id = int(r[0])
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
50 # clear out any existing indexed values
2098
18addf2a8596 Implemented proper datatypes in mysql and postgresql backends...
Richard Jones <richard@users.sourceforge.net>
parents: 2093
diff changeset
51 sql = 'delete from __words where _textid=%s'%a
2093
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
52 self.db.cursor.execute(sql, (id, ))
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
53
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
54 # ok, find all the words in the text
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
55 wordlist = re.findall(r'\b\w{2,25}\b', str(text).upper())
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
56 words = {}
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
57 for word in wordlist:
2872
d530b68e4b42 don't index common words [SF#1046612]
Richard Jones <richard@users.sourceforge.net>
parents: 2098
diff changeset
58 if is_stopword(word):
d530b68e4b42 don't index common words [SF#1046612]
Richard Jones <richard@users.sourceforge.net>
parents: 2098
diff changeset
59 continue
d530b68e4b42 don't index common words [SF#1046612]
Richard Jones <richard@users.sourceforge.net>
parents: 2098
diff changeset
60 words[word] = 1
2093
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
61 words = words.keys()
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
62
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
63 # for each word, add an entry in the db
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
64 for word in words:
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
65 # don't dupe
2098
18addf2a8596 Implemented proper datatypes in mysql and postgresql backends...
Richard Jones <richard@users.sourceforge.net>
parents: 2093
diff changeset
66 sql = 'select * from __words where _word=%s and _textid=%s'%(a, a)
2093
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
67 self.db.cursor.execute(sql, (word, id))
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
68 if self.db.cursor.fetchall():
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
69 continue
2098
18addf2a8596 Implemented proper datatypes in mysql and postgresql backends...
Richard Jones <richard@users.sourceforge.net>
parents: 2093
diff changeset
70 sql = 'insert into __words (_word, _textid) values (%s, %s)'%(a, a)
2093
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
71 self.db.cursor.execute(sql, (word, id))
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
72
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
73 def find(self, wordlist):
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
74 '''look up all the words in the wordlist.
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
75 If none are found return an empty dictionary
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
76 * more rules here
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
77 '''
3033
f8d0fd056ac0 fix indexer searching with no valid words [SF#1086787]
Richard Jones <richard@users.sourceforge.net>
parents: 2872
diff changeset
78 if not wordlist:
f8d0fd056ac0 fix indexer searching with no valid words [SF#1086787]
Richard Jones <richard@users.sourceforge.net>
parents: 2872
diff changeset
79 return {}
f8d0fd056ac0 fix indexer searching with no valid words [SF#1086787]
Richard Jones <richard@users.sourceforge.net>
parents: 2872
diff changeset
80
2093
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
81 l = [word.upper() for word in wordlist if 26 > len(word) > 2]
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
82
3048
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
83 if not l:
2093
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
84 return {}
3048
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
85
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
86 if self.db.implements_intersect:
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
87 # simple AND search
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
88 sql = 'select distinct(_textid) from __words where _word=%s'%self.db.arg
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
89 sql = '\nINTERSECT\n'.join([sql]*len(l))
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
90 self.db.cursor.execute(sql, tuple(l))
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
91 r = self.db.cursor.fetchall()
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
92 if not r:
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
93 return {}
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
94 a = ','.join([self.db.arg] * len(r))
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
95 sql = 'select _class, _itemid, _prop from __textids '\
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
96 'where _textid in (%s)'%a
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
97 self.db.cursor.execute(sql, tuple([int(id) for (id,) in r]))
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
98
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
99 else:
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
100 # A more complex version for MySQL since it doesn't implement INTERSECT
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
101
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
102 # Construct SQL statement to join __words table to itself
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
103 # multiple times.
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
104 sql = """select distinct(__words1._textid)
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
105 from __words as __words1 %s
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
106 where __words1._word=%s %s"""
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
107
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
108 join_tmpl = ' left join __words as __words%d using (_textid) \n'
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
109 match_tmpl = ' and __words%d._word=%s \n'
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
110
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
111 join_list = []
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
112 match_list = []
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
113 for n in xrange(len(l) - 1):
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
114 join_list.append(join_tmpl % (n + 2))
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
115 match_list.append(match_tmpl % (n + 2, self.db.arg))
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
116
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
117 sql = sql%(' '.join(join_list), self.db.arg, ' '.join(match_list))
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
118 self.db.cursor.execute(sql, l)
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
119
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
120 r = map(lambda x: x[0], self.db.cursor.fetchall())
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
121 if not r:
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
122 return {}
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
123
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
124 a = ','.join([self.db.arg] * len(r))
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
125 sql = 'select _class, _itemid, _prop from __textids '\
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
126 'where _textid in (%s)'%a
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
127
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
128 self.db.cursor.execute(sql, tuple(map(int, r)))
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
129
3076
2817a4db901d Change indexer_common.search() to take a list of nodeids...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents: 3048
diff changeset
130 return self.db.cursor.fetchall()
2093
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
131

Roundup Issue Tracker: http://roundup-tracker.org/