annotate roundup/backends/indexer_rdbms.py @ 3854:f4e8dc583256

Restored subject parser regexp to the string it was before the... ...implementation of customization of it, i.e., the version from CVS revision 1.184 of mailgw.py. This makes 'testFollowupTitleMatchMultiRe' work again.
author Erik Forsberg <forsberg@users.sourceforge.net>
date Sat, 12 May 2007 16:14:54 +0000
parents 0d561b24ceff
children 82e116d515d2
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
3718
0d561b24ceff support sqlite3
Richard Jones <richard@users.sourceforge.net>
parents: 3618
diff changeset
1 #$Id: indexer_rdbms.py,v 1.15 2006-10-04 01:12:00 richard Exp $
2093
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
2 ''' This implements the full-text indexer over two RDBMS tables. The first
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
3 is a mapping of words to occurance IDs. The second maps the IDs to (Class,
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
4 propname, itemid) instances.
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
5 '''
3617
f12722c7b9ee improvements
Richard Jones <richard@users.sourceforge.net>
parents: 3544
diff changeset
6 import re, sets
2093
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
7
3544
5cd1c83dea50 Features and fixes.
Richard Jones <richard@users.sourceforge.net>
parents: 3414
diff changeset
8 from roundup.backends.indexer_common import Indexer as IndexerBase
2093
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
9
3544
5cd1c83dea50 Features and fixes.
Richard Jones <richard@users.sourceforge.net>
parents: 3414
diff changeset
10 class Indexer(IndexerBase):
2093
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
11 def __init__(self, db):
3544
5cd1c83dea50 Features and fixes.
Richard Jones <richard@users.sourceforge.net>
parents: 3414
diff changeset
12 IndexerBase.__init__(self, db)
2093
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
13 self.db = db
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
14 self.reindex = 0
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
15
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
16 def close(self):
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
17 '''close the indexing database'''
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
18 # just nuke the circular reference
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
19 self.db = None
3295
a615cc230160 added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents: 3092
diff changeset
20
a615cc230160 added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents: 3092
diff changeset
21 def save_index(self):
a615cc230160 added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents: 3092
diff changeset
22 '''Save the changes to the index.'''
a615cc230160 added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents: 3092
diff changeset
23 # not necessary - the RDBMS connection will handle this for us
a615cc230160 added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents: 3092
diff changeset
24 pass
3331
7bc09d5d9544 perform word splitting in unicode for national characters support
Alexander Smishlajev <a1s@users.sourceforge.net>
parents: 3295
diff changeset
25
2093
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
26 def force_reindex(self):
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
27 '''Force a reindexing of the database. This essentially
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
28 empties the tables ids and index and sets a flag so
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
29 that the databases are reindexed'''
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
30 self.reindex = 1
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
31
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
32 def should_reindex(self):
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
33 '''returns True if the indexes need to be rebuilt'''
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
34 return self.reindex
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
35
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
36 def add_text(self, identifier, text, mime_type='text/plain'):
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
37 ''' "identifier" is (classname, itemid, property) '''
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
38 if mime_type != 'text/plain':
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
39 return
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
40
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
41 # first, find the id of the (classname, itemid, property)
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
42 a = self.db.arg
2098
18addf2a8596 Implemented proper datatypes in mysql and postgresql backends...
Richard Jones <richard@users.sourceforge.net>
parents: 2093
diff changeset
43 sql = 'select _textid from __textids where _class=%s and '\
2093
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
44 '_itemid=%s and _prop=%s'%(a, a, a)
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
45 self.db.cursor.execute(sql, identifier)
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
46 r = self.db.cursor.fetchone()
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
47 if not r:
3617
f12722c7b9ee improvements
Richard Jones <richard@users.sourceforge.net>
parents: 3544
diff changeset
48 # not previously indexed
2098
18addf2a8596 Implemented proper datatypes in mysql and postgresql backends...
Richard Jones <richard@users.sourceforge.net>
parents: 2093
diff changeset
49 id = self.db.newid('__textids')
18addf2a8596 Implemented proper datatypes in mysql and postgresql backends...
Richard Jones <richard@users.sourceforge.net>
parents: 2093
diff changeset
50 sql = 'insert into __textids (_textid, _class, _itemid, _prop)'\
2093
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
51 ' values (%s, %s, %s, %s)'%(a, a, a, a)
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
52 self.db.cursor.execute(sql, (id, ) + identifier)
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
53 else:
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
54 id = int(r[0])
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
55 # clear out any existing indexed values
2098
18addf2a8596 Implemented proper datatypes in mysql and postgresql backends...
Richard Jones <richard@users.sourceforge.net>
parents: 2093
diff changeset
56 sql = 'delete from __words where _textid=%s'%a
2093
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
57 self.db.cursor.execute(sql, (id, ))
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
58
3617
f12722c7b9ee improvements
Richard Jones <richard@users.sourceforge.net>
parents: 3544
diff changeset
59 # ok, find all the unique words in the text
3331
7bc09d5d9544 perform word splitting in unicode for national characters support
Alexander Smishlajev <a1s@users.sourceforge.net>
parents: 3295
diff changeset
60 text = unicode(text, "utf-8", "replace").upper()
7bc09d5d9544 perform word splitting in unicode for national characters support
Alexander Smishlajev <a1s@users.sourceforge.net>
parents: 3295
diff changeset
61 wordlist = [w.encode("utf-8", "replace")
7bc09d5d9544 perform word splitting in unicode for national characters support
Alexander Smishlajev <a1s@users.sourceforge.net>
parents: 3295
diff changeset
62 for w in re.findall(r'(?u)\b\w{2,25}\b', text)]
3617
f12722c7b9ee improvements
Richard Jones <richard@users.sourceforge.net>
parents: 3544
diff changeset
63 words = sets.Set()
2093
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
64 for word in wordlist:
3544
5cd1c83dea50 Features and fixes.
Richard Jones <richard@users.sourceforge.net>
parents: 3414
diff changeset
65 if self.is_stopword(word): continue
3414
89a5c8e86346 merge from maint branch
Richard Jones <richard@users.sourceforge.net>
parents: 3331
diff changeset
66 if len(word) > 25: continue
3617
f12722c7b9ee improvements
Richard Jones <richard@users.sourceforge.net>
parents: 3544
diff changeset
67 words.add(word)
2093
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
68
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
69 # for each word, add an entry in the db
3617
f12722c7b9ee improvements
Richard Jones <richard@users.sourceforge.net>
parents: 3544
diff changeset
70 sql = 'insert into __words (_word, _textid) values (%s, %s)'%(a, a)
f12722c7b9ee improvements
Richard Jones <richard@users.sourceforge.net>
parents: 3544
diff changeset
71 words = [(word, id) for word in words]
3618
b31a2e35be80 pysqlite 1.1.6 does not allow to pass a list of tuples to cursor.execute().
Alexander Smishlajev <a1s@users.sourceforge.net>
parents: 3617
diff changeset
72 self.db.cursor.executemany(sql, words)
2093
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
73
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
74 def find(self, wordlist):
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
75 '''look up all the words in the wordlist.
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
76 If none are found return an empty dictionary
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
77 * more rules here
3331
7bc09d5d9544 perform word splitting in unicode for national characters support
Alexander Smishlajev <a1s@users.sourceforge.net>
parents: 3295
diff changeset
78 '''
3033
f8d0fd056ac0 fix indexer searching with no valid words [SF#1086787]
Richard Jones <richard@users.sourceforge.net>
parents: 2872
diff changeset
79 if not wordlist:
f8d0fd056ac0 fix indexer searching with no valid words [SF#1086787]
Richard Jones <richard@users.sourceforge.net>
parents: 2872
diff changeset
80 return {}
f8d0fd056ac0 fix indexer searching with no valid words [SF#1086787]
Richard Jones <richard@users.sourceforge.net>
parents: 2872
diff changeset
81
2093
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
82 l = [word.upper() for word in wordlist if 26 > len(word) > 2]
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
83
3048
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
84 if not l:
2093
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
85 return {}
3048
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
86
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
87 if self.db.implements_intersect:
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
88 # simple AND search
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
89 sql = 'select distinct(_textid) from __words where _word=%s'%self.db.arg
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
90 sql = '\nINTERSECT\n'.join([sql]*len(l))
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
91 self.db.cursor.execute(sql, tuple(l))
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
92 r = self.db.cursor.fetchall()
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
93 if not r:
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
94 return {}
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
95 a = ','.join([self.db.arg] * len(r))
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
96 sql = 'select _class, _itemid, _prop from __textids '\
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
97 'where _textid in (%s)'%a
3718
0d561b24ceff support sqlite3
Richard Jones <richard@users.sourceforge.net>
parents: 3618
diff changeset
98 self.db.cursor.execute(sql, tuple([int(row[0]) for row in r]))
3048
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
99
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
100 else:
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
101 # A more complex version for MySQL since it doesn't implement INTERSECT
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
102
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
103 # Construct SQL statement to join __words table to itself
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
104 # multiple times.
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
105 sql = """select distinct(__words1._textid)
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
106 from __words as __words1 %s
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
107 where __words1._word=%s %s"""
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
108
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
109 join_tmpl = ' left join __words as __words%d using (_textid) \n'
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
110 match_tmpl = ' and __words%d._word=%s \n'
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
111
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
112 join_list = []
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
113 match_list = []
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
114 for n in xrange(len(l) - 1):
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
115 join_list.append(join_tmpl % (n + 2))
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
116 match_list.append(match_tmpl % (n + 2, self.db.arg))
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
117
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
118 sql = sql%(' '.join(join_list), self.db.arg, ' '.join(match_list))
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
119 self.db.cursor.execute(sql, l)
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
120
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
121 r = map(lambda x: x[0], self.db.cursor.fetchall())
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
122 if not r:
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
123 return {}
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
124
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
125 a = ','.join([self.db.arg] * len(r))
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
126 sql = 'select _class, _itemid, _prop from __textids '\
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
127 'where _textid in (%s)'%a
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
128
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
129 self.db.cursor.execute(sql, tuple(map(int, r)))
d9b4224f955c merge from maint-0-8
Richard Jones <richard@users.sourceforge.net>
parents: 3033
diff changeset
130
3076
2817a4db901d Change indexer_common.search() to take a list of nodeids...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents: 3048
diff changeset
131 return self.db.cursor.fetchall()
2093
3f6024ab2c7a That's the last of the RDBMS migration steps done! Yay!
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
132

Roundup Issue Tracker: http://roundup-tracker.org/