annotate roundup/backends/indexer_dbm.py @ 3854:f4e8dc583256

Restored subject parser regexp to the string it was before the... ...implementation of customization of it, i.e., the version from CVS revision 1.184 of mailgw.py. This makes 'testFollowupTitleMatchMultiRe' work again.
author Erik Forsberg <forsberg@users.sourceforge.net>
date Sat, 12 May 2007 16:14:54 +0000
parents 5f4db2650da3
children 2ff6f39aa391
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
2089
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
1 #
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
2 # This module is derived from the module described at:
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
3 # http://gnosis.cx/publish/programming/charming_python_15.txt
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
4 #
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
5 # Author: David Mertz (mertz@gnosis.cx)
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
6 # Thanks to: Pat Knight (p.knight@ktgroup.co.uk)
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
7 # Gregory Popovitch (greg@gpy.com)
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
8 #
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
9 # The original module was released under this license, and remains under
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
10 # it:
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
11 #
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
12 # This file is released to the public domain. I (dqm) would
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
13 # appreciate it if you choose to keep derived works under terms
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
14 # that promote freedom, but obviously am giving up any rights
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
15 # to compel such.
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
16 #
3613
5f4db2650da3 implement close() on all indexers [SF#1242477]
Richard Jones <richard@users.sourceforge.net>
parents: 3555
diff changeset
17 #$Id: indexer_dbm.py,v 1.9 2006-04-27 05:48:26 richard Exp $
2089
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
18 '''This module provides an indexer class, RoundupIndexer, that stores text
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
19 indices in a roundup instance. This class makes searching the content of
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
20 messages, string properties and text files possible.
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
21 '''
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
22 __docformat__ = 'restructuredtext'
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
23
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
24 import os, shutil, re, mimetypes, marshal, zlib, errno
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
25 from roundup.hyperdb import Link, Multilink
3544
5cd1c83dea50 Features and fixes.
Richard Jones <richard@users.sourceforge.net>
parents: 3295
diff changeset
26 from roundup.backends.indexer_common import Indexer as IndexerBase
2872
d530b68e4b42 don't index common words [SF#1046612]
Richard Jones <richard@users.sourceforge.net>
parents: 2089
diff changeset
27
3544
5cd1c83dea50 Features and fixes.
Richard Jones <richard@users.sourceforge.net>
parents: 3295
diff changeset
28 class Indexer(IndexerBase):
2089
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
29 '''Indexes information from roundup's hyperdb to allow efficient
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
30 searching.
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
31
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
32 Three structures are created by the indexer::
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
33
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
34 files {identifier: (fileid, wordcount)}
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
35 words {word: {fileid: count}}
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
36 fileids {fileid: identifier}
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
37
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
38 where identifier is (classname, nodeid, propertyname)
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
39 '''
3295
a615cc230160 added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents: 3092
diff changeset
40 def __init__(self, db):
3544
5cd1c83dea50 Features and fixes.
Richard Jones <richard@users.sourceforge.net>
parents: 3295
diff changeset
41 IndexerBase.__init__(self, db)
3295
a615cc230160 added Xapian indexer; replaces standard indexers if Xapian is available
Richard Jones <richard@users.sourceforge.net>
parents: 3092
diff changeset
42 self.indexdb_path = os.path.join(db.config.DATABASE, 'indexes')
2089
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
43 self.indexdb = os.path.join(self.indexdb_path, 'index.db')
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
44 self.reindex = 0
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
45 self.quiet = 9
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
46 self.changed = 0
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
47
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
48 # see if we need to reindex because of a change in code
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
49 version = os.path.join(self.indexdb_path, 'version')
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
50 if (not os.path.exists(self.indexdb_path) or
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
51 not os.path.exists(version)):
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
52 # for now the file itself is a flag
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
53 self.force_reindex()
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
54 elif os.path.exists(version):
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
55 version = open(version).read()
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
56 # check the value and reindex if it's not the latest
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
57 if version.strip() != '1':
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
58 self.force_reindex()
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
59
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
60 def force_reindex(self):
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
61 '''Force a reindex condition
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
62 '''
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
63 if os.path.exists(self.indexdb_path):
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
64 shutil.rmtree(self.indexdb_path)
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
65 os.makedirs(self.indexdb_path)
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
66 os.chmod(self.indexdb_path, 0775)
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
67 open(os.path.join(self.indexdb_path, 'version'), 'w').write('1\n')
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
68 self.reindex = 1
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
69 self.changed = 1
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
70
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
71 def should_reindex(self):
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
72 '''Should we reindex?
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
73 '''
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
74 return self.reindex
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
75
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
76 def add_text(self, identifier, text, mime_type='text/plain'):
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
77 '''Add some text associated with the (classname, nodeid, property)
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
78 identifier.
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
79 '''
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
80 # make sure the index is loaded
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
81 self.load_index()
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
82
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
83 # remove old entries for this identifier
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
84 if self.files.has_key(identifier):
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
85 self.purge_entry(identifier)
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
86
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
87 # split into words
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
88 words = self.splitter(text, mime_type)
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
89
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
90 # Find new file index, and assign it to identifier
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
91 # (_TOP uses trick of negative to avoid conflict with file index)
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
92 self.files['_TOP'] = (self.files['_TOP'][0]-1, None)
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
93 file_index = abs(self.files['_TOP'][0])
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
94 self.files[identifier] = (file_index, len(words))
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
95 self.fileids[file_index] = identifier
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
96
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
97 # find the unique words
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
98 filedict = {}
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
99 for word in words:
3544
5cd1c83dea50 Features and fixes.
Richard Jones <richard@users.sourceforge.net>
parents: 3295
diff changeset
100 if self.is_stopword(word):
2872
d530b68e4b42 don't index common words [SF#1046612]
Richard Jones <richard@users.sourceforge.net>
parents: 2089
diff changeset
101 continue
2089
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
102 if filedict.has_key(word):
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
103 filedict[word] = filedict[word]+1
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
104 else:
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
105 filedict[word] = 1
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
106
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
107 # now add to the totals
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
108 for word in filedict.keys():
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
109 # each word has a dict of {identifier: count}
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
110 if self.words.has_key(word):
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
111 entry = self.words[word]
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
112 else:
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
113 # new word
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
114 entry = {}
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
115 self.words[word] = entry
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
116
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
117 # make a reference to the file for this word
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
118 entry[file_index] = filedict[word]
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
119
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
120 # save needed
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
121 self.changed = 1
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
122
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
123 def splitter(self, text, ftype):
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
124 '''Split the contents of a text string into a list of 'words'
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
125 '''
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
126 if ftype == 'text/plain':
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
127 words = self.text_splitter(text)
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
128 else:
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
129 return []
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
130 return words
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
131
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
132 def text_splitter(self, text):
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
133 """Split text/plain string into a list of words
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
134 """
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
135 # case insensitive
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
136 text = str(text).upper()
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
137
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
138 # Split the raw text, losing anything longer than 25 characters
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
139 # since that'll be gibberish (encoded text or somesuch) or shorter
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
140 # than 3 characters since those short words appear all over the
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
141 # place
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
142 return re.findall(r'\b\w{2,25}\b', text)
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
143
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
144 # we override this to ignore not 2 < word < 25 and also to fix a bug -
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
145 # the (fail) case.
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
146 def find(self, wordlist):
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
147 '''Locate files that match ALL the words in wordlist
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
148 '''
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
149 if not hasattr(self, 'words'):
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
150 self.load_index()
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
151 self.load_index(wordlist=wordlist)
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
152 entries = {}
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
153 hits = None
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
154 for word in wordlist:
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
155 if not 2 < len(word) < 25:
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
156 # word outside the bounds of what we index - ignore
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
157 continue
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
158 word = word.upper()
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
159 entry = self.words.get(word) # For each word, get index
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
160 entries[word] = entry # of matching files
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
161 if not entry: # Nothing for this one word (fail)
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
162 return {}
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
163 if hits is None:
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
164 hits = {}
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
165 for k in entry.keys():
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
166 if not self.fileids.has_key(k):
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
167 raise ValueError, 'Index is corrupted: re-generate it'
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
168 hits[k] = self.fileids[k]
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
169 else:
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
170 # Eliminate hits for every non-match
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
171 for fileid in hits.keys():
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
172 if not entry.has_key(fileid):
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
173 del hits[fileid]
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
174 if hits is None:
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
175 return {}
3076
2817a4db901d Change indexer_common.search() to take a list of nodeids...
Johannes Gijsbers <jlgijsbers@users.sourceforge.net>
parents: 3058
diff changeset
176 return hits.values()
2089
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
177
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
178 segments = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ#_-!"
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
179 def load_index(self, reload=0, wordlist=None):
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
180 # Unless reload is indicated, do not load twice
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
181 if self.index_loaded() and not reload:
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
182 return 0
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
183
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
184 # Ok, now let's actually load it
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
185 db = {'WORDS': {}, 'FILES': {'_TOP':(0,None)}, 'FILEIDS': {}}
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
186
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
187 # Identify the relevant word-dictionary segments
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
188 if not wordlist:
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
189 segments = self.segments
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
190 else:
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
191 segments = ['-','#']
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
192 for word in wordlist:
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
193 segments.append(word[0].upper())
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
194
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
195 # Load the segments
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
196 for segment in segments:
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
197 try:
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
198 f = open(self.indexdb + segment, 'rb')
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
199 except IOError, error:
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
200 # probably just nonexistent segment index file
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
201 if error.errno != errno.ENOENT: raise
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
202 else:
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
203 pickle_str = zlib.decompress(f.read())
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
204 f.close()
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
205 dbslice = marshal.loads(pickle_str)
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
206 if dbslice.get('WORDS'):
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
207 # if it has some words, add them
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
208 for word, entry in dbslice['WORDS'].items():
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
209 db['WORDS'][word] = entry
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
210 if dbslice.get('FILES'):
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
211 # if it has some files, add them
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
212 db['FILES'] = dbslice['FILES']
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
213 if dbslice.get('FILEIDS'):
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
214 # if it has fileids, add them
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
215 db['FILEIDS'] = dbslice['FILEIDS']
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
216
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
217 self.words = db['WORDS']
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
218 self.files = db['FILES']
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
219 self.fileids = db['FILEIDS']
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
220 self.changed = 0
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
221
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
222 def save_index(self):
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
223 # only save if the index is loaded and changed
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
224 if not self.index_loaded() or not self.changed:
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
225 return
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
226
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
227 # brutal space saver... delete all the small segments
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
228 for segment in self.segments:
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
229 try:
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
230 os.remove(self.indexdb + segment)
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
231 except OSError, error:
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
232 # probably just nonexistent segment index file
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
233 if error.errno != errno.ENOENT: raise
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
234
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
235 # First write the much simpler filename/fileid dictionaries
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
236 dbfil = {'WORDS':None, 'FILES':self.files, 'FILEIDS':self.fileids}
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
237 open(self.indexdb+'-','wb').write(zlib.compress(marshal.dumps(dbfil)))
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
238
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
239 # The hard part is splitting the word dictionary up, of course
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
240 letters = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ#_"
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
241 segdicts = {} # Need batch of empty dicts
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
242 for segment in letters:
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
243 segdicts[segment] = {}
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
244 for word, entry in self.words.items(): # Split into segment dicts
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
245 initchar = word[0].upper()
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
246 segdicts[initchar][word] = entry
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
247
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
248 # save
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
249 for initchar in letters:
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
250 db = {'WORDS':segdicts[initchar], 'FILES':None, 'FILEIDS':None}
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
251 pickle_str = marshal.dumps(db)
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
252 filename = self.indexdb + initchar
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
253 pickle_fh = open(filename, 'wb')
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
254 pickle_fh.write(zlib.compress(pickle_str))
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
255 os.chmod(filename, 0664)
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
256
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
257 # save done
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
258 self.changed = 0
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
259
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
260 def purge_entry(self, identifier):
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
261 '''Remove a file from file index and word index
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
262 '''
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
263 self.load_index()
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
264
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
265 if not self.files.has_key(identifier):
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
266 return
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
267
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
268 file_index = self.files[identifier][0]
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
269 del self.files[identifier]
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
270 del self.fileids[file_index]
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
271
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
272 # The much harder part, cleanup the word index
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
273 for key, occurs in self.words.items():
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
274 if occurs.has_key(file_index):
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
275 del occurs[file_index]
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
276
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
277 # save needed
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
278 self.changed = 1
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
279
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
280 def index_loaded(self):
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
281 return (hasattr(self,'fileids') and hasattr(self,'files') and
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
282 hasattr(self,'words'))
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
283
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
284 def rollback(self):
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
285 ''' load last saved index info. '''
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
286 self.load_index(reload=1)
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
287
3613
5f4db2650da3 implement close() on all indexers [SF#1242477]
Richard Jones <richard@users.sourceforge.net>
parents: 3555
diff changeset
288 def close(self):
5f4db2650da3 implement close() on all indexers [SF#1242477]
Richard Jones <richard@users.sourceforge.net>
parents: 3555
diff changeset
289 pass
5f4db2650da3 implement close() on all indexers [SF#1242477]
Richard Jones <richard@users.sourceforge.net>
parents: 3555
diff changeset
290
5f4db2650da3 implement close() on all indexers [SF#1242477]
Richard Jones <richard@users.sourceforge.net>
parents: 3555
diff changeset
291
2089
93f03c6714d8 A few big changes in this commit:
Richard Jones <richard@users.sourceforge.net>
parents:
diff changeset
292 # vim: set filetype=python ts=4 sw=4 et si

Roundup Issue Tracker: http://roundup-tracker.org/