annotate roundup/indexer.py @ 834:568eed5fb4fd

Optimize Class.find so that the propspec can contain a set of ids to match. This is used by indexer.search so it can do just one find for all the index matches. This was already confusing code, but for common terms (lots of index matches), it is enormously faster.
author Gordon B. McMillan <gmcm@users.sourceforge.net>
date Tue, 09 Jul 2002 21:53:38 +0000
parents b80aaedba3db
children ba38e1e718f2
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
1 #
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
2 # This module is derived from the module described at:
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
3 # http://gnosis.cx/publish/programming/charming_python_15.txt
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
4 #
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
5 # Author: David Mertz (mertz@gnosis.cx)
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
6 # Thanks to: Pat Knight (p.knight@ktgroup.co.uk)
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
7 # Gregory Popovitch (greg@gpy.com)
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
8 #
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
9 # The original module was released under this license, and remains under
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
10 # it:
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
11 #
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
12 # This file is released to the public domain. I (dqm) would
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
13 # appreciate it if you choose to keep derived works under terms
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
14 # that promote freedom, but obviously am giving up any rights
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
15 # to compel such.
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
16 #
834
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
17 #$Id: indexer.py,v 1.8 2002-07-09 21:53:38 gmcm Exp $
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
18 '''
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
19 This module provides an indexer class, RoundupIndexer, that stores text
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
20 indices in a roundup instance. This class makes searching the content of
834
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
21 messages, string properties and text files possible.
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
22 '''
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
23 import os, shutil, re, mimetypes, marshal, zlib, errno
834
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
24 from hyperdb import Link, Multilink
749
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
25
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
26 class Indexer:
825
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
27 ''' Indexes information from roundup's hyperdb to allow efficient
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
28 searching.
826
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
29
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
30 Three structures are created by the indexer:
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
31 files {identifier: (fileid, wordcount)}
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
32 words {word: {fileid: count}}
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
33 fileids {fileid: identifier}
834
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
34 where identifier is (classname, nodeid, propertyname)
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
35 '''
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
36 def __init__(self, db_path):
826
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
37 self.indexdb_path = os.path.join(db_path, 'indexes')
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
38 self.indexdb = os.path.join(self.indexdb_path, 'index.db')
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
39 self.reindex = 0
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
40 self.quiet = 9
833
b80aaedba3db Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents: 827
diff changeset
41 self.changed = 0
749
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
42
825
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
43 # see if we need to reindex because of a change in code
826
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
44 if (not os.path.exists(self.indexdb_path) or
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
45 not os.path.exists(os.path.join(self.indexdb_path, 'version'))):
825
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
46 # TODO: if the version file exists (in the future) we'll want to
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
47 # check the value in it - for now the file itself is a flag
826
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
48 self.force_reindex()
749
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
49
826
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
50 def force_reindex(self):
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
51 '''Force a reindex condition
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
52 '''
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
53 if os.path.exists(self.indexdb_path):
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
54 shutil.rmtree(self.indexdb_path)
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
55 os.makedirs(self.indexdb_path)
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
56 os.chmod(self.indexdb_path, 0775)
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
57 open(os.path.join(self.indexdb_path, 'version'), 'w').write('1\n')
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
58 self.reindex = 1
833
b80aaedba3db Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents: 827
diff changeset
59 self.changed = 1
749
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
60
825
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
61 def should_reindex(self):
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
62 '''Should we reindex?
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
63 '''
825
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
64 return self.reindex
749
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
65
825
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
66 def add_text(self, identifier, text, mime_type='text/plain'):
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
67 ''' Add some text associated with the (classname, nodeid, property)
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
68 identifier.
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
69 '''
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
70 # make sure the index is loaded
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
71 self.load_index()
749
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
72
826
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
73 # remove old entries for this identifier
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
74 if self.files.has_key(identifier):
826
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
75 self.purge_entry(identifier)
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
76
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
77 # split into words
825
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
78 words = self.splitter(text, mime_type)
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
79
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
80 # Find new file index, and assign it to identifier
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
81 # (_TOP uses trick of negative to avoid conflict with file index)
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
82 self.files['_TOP'] = (self.files['_TOP'][0]-1, None)
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
83 file_index = abs(self.files['_TOP'][0])
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
84 self.files[identifier] = (file_index, len(words))
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
85 self.fileids[file_index] = identifier
749
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
86
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
87 # find the unique words
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
88 filedict = {}
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
89 for word in words:
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
90 if filedict.has_key(word):
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
91 filedict[word] = filedict[word]+1
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
92 else:
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
93 filedict[word] = 1
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
94
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
95 # now add to the totals
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
96 for word in filedict.keys():
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
97 # each word has a dict of {identifier: count}
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
98 if self.words.has_key(word):
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
99 entry = self.words[word]
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
100 else:
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
101 # new word
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
102 entry = {}
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
103 self.words[word] = entry
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
104
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
105 # make a reference to the file for this word
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
106 entry[file_index] = filedict[word]
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
107
833
b80aaedba3db Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents: 827
diff changeset
108 # save needed
b80aaedba3db Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents: 827
diff changeset
109 self.changed = 1
b80aaedba3db Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents: 827
diff changeset
110
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
111 def splitter(self, text, ftype):
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
112 ''' Split the contents of a text string into a list of 'words'
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
113 '''
825
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
114 if ftype == 'text/plain':
833
b80aaedba3db Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents: 827
diff changeset
115 words = self.text_splitter(text)
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
116 else:
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
117 return []
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
118 return words
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
119
833
b80aaedba3db Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents: 827
diff changeset
120 def text_splitter(self, text):
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
121 """Split text/plain string into a list of words
749
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
122 """
833
b80aaedba3db Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents: 827
diff changeset
123 # case insensitive
b80aaedba3db Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents: 827
diff changeset
124 text = text.upper()
749
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
125
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
126 # Split the raw text, losing anything longer than 25 characters
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
127 # since that'll be gibberish (encoded text or somesuch) or shorter
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
128 # than 3 characters since those short words appear all over the
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
129 # place
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
130 return re.findall(r'\b\w{2,25}\b', text)
749
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
131
825
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
132 def search(self, search_terms, klass, ignore={},
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
133 dre=re.compile(r'([^\d]+)(\d+)')):
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
134 ''' Display search results looking for [search, terms] associated
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
135 with the hyperdb Class "klass". Ignore hits on {class: property}.
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
136
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
137 "dre" is a helper, not an argument.
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
138 '''
825
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
139 # do the index lookup
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
140 hits = self.find(search_terms)
825
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
141 if not hits:
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
142 return {}
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
143
834
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
144 #designator_propname = {'msg': 'messages', 'file': 'files'}
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
145 designator_propname = {}
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
146 for nm, propclass in klass.getprops().items():
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
147 if isinstance(propclass, Link) or isinstance(propclass, Multilink):
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
148 designator_propname[propclass.classname] = nm
825
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
149
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
150 # build a dictionary of nodes and their associated messages
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
151 # and files
834
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
152 nodeids = {} # this is the answer
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
153 propspec = {} # used to do the klass.find
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
154 for propname in designator_propname.values():
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
155 propspec[propname] = {} # used as a set (value doesn't matter)
825
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
156 for classname, nodeid, property in hits.values():
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
157 # skip this result if we don't care about this class/property
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
158 if ignore.has_key((classname, property)):
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
159 continue
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
160
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
161 # if it's a property on klass, it's easy
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
162 if classname == klass.classname:
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
163 if not nodeids.has_key(nodeid):
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
164 nodeids[nodeid] = {}
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
165 continue
749
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
166
834
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
167 # it's a linked class - set up to do the klass.find
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
168 linkprop = designator_propname[classname] # eg, msg -> messages
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
169 propspec[linkprop][nodeid] = 1
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
170
834
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
171 # retain only the meaningful entries
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
172 for propname, idset in propspec.items():
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
173 if not idset:
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
174 del propspec[propname]
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
175
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
176 # klass.find tells me the klass nodeids the linked nodes relate to
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
177 for resid in klass.find(**propspec):
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
178 resid = str(resid)
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
179 if not nodeids.has_key(id):
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
180 nodeids[resid] = {}
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
181 node_dict = nodeids[resid]
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
182 # now figure out where it came from
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
183 for linkprop in propspec.keys():
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
184 for nodeid in klass.get(resid, linkprop):
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
185 if propspec[linkprop].has_key(nodeid):
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
186 # OK, this node[propname] has a winner
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
187 if not node_dict.has_key(linkprop):
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
188 node_dict[linkprop] = [nodeid]
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
189 else:
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
190 node_dict[linkprop].append(nodeid)
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
191 return nodeids
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
192
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
193 # we override this to ignore not 2 < word < 25 and also to fix a bug -
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
194 # the (fail) case.
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
195 def find(self, wordlist):
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
196 ''' Locate files that match ALL the words in wordlist
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
197 '''
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
198 if not hasattr(self, 'words'):
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
199 self.load_index()
749
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
200 self.load_index(wordlist=wordlist)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
201 entries = {}
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
202 hits = None
749
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
203 for word in wordlist:
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
204 if not 2 < len(word) < 25:
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
205 # word outside the bounds of what we index - ignore
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
206 continue
833
b80aaedba3db Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents: 827
diff changeset
207 word = word.upper()
749
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
208 entry = self.words.get(word) # For each word, get index
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
209 entries[word] = entry # of matching files
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
210 if not entry: # Nothing for this one word (fail)
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
211 return {}
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
212 if hits is None:
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
213 hits = {}
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
214 for k in entry.keys():
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
215 hits[k] = self.fileids[k]
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
216 else:
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
217 # Eliminate hits for every non-match
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
218 for fileid in hits.keys():
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
219 if not entry.has_key(fileid):
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
220 del hits[fileid]
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
221 if hits is None:
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
222 return {}
749
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
223 return hits
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
224
827
0a2c1f5e0e5a We're indexing numbers now, and _underscore words
Richard Jones <richard@users.sourceforge.net>
parents: 826
diff changeset
225 segments = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ#_-!"
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
226 def load_index(self, reload=0, wordlist=None):
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
227 # Unless reload is indicated, do not load twice
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
228 if self.index_loaded() and not reload:
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
229 return 0
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
230
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
231 # Ok, now let's actually load it
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
232 db = {'WORDS': {}, 'FILES': {'_TOP':(0,None)}, 'FILEIDS': {}}
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
233
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
234 # Identify the relevant word-dictionary segments
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
235 if not wordlist:
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
236 segments = self.segments
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
237 else:
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
238 segments = ['-','#']
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
239 for word in wordlist:
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
240 segments.append(word[0].upper())
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
241
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
242 # Load the segments
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
243 for segment in segments:
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
244 try:
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
245 f = open(self.indexdb + segment, 'rb')
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
246 except IOError, error:
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
247 if error.errno != errno.ENOENT:
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
248 raise
749
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
249 else:
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
250 pickle_str = zlib.decompress(f.read())
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
251 f.close()
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
252 dbslice = marshal.loads(pickle_str)
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
253 if dbslice.get('WORDS'):
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
254 # if it has some words, add them
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
255 for word, entry in dbslice['WORDS'].items():
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
256 db['WORDS'][word] = entry
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
257 if dbslice.get('FILES'):
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
258 # if it has some files, add them
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
259 db['FILES'] = dbslice['FILES']
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
260 if dbslice.get('FILEIDS'):
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
261 # if it has fileids, add them
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
262 db['FILEIDS'] = dbslice['FILEIDS']
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
263
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
264 self.words = db['WORDS']
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
265 self.files = db['FILES']
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
266 self.fileids = db['FILEIDS']
833
b80aaedba3db Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents: 827
diff changeset
267 self.changed = 0
749
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
268
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
269 def save_index(self):
833
b80aaedba3db Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents: 827
diff changeset
270 # only save if the index is loaded and changed
b80aaedba3db Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents: 827
diff changeset
271 if not self.index_loaded() or not self.changed:
b80aaedba3db Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents: 827
diff changeset
272 return
825
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
273
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
274 # brutal space saver... delete all the small segments
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
275 for segment in self.segments:
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
276 try:
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
277 os.remove(self.indexdb + segment)
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
278 except OSError:
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
279 # probably just nonexistent segment index file
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
280 # TODO: make sure it's an EEXIST
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
281 pass
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
282
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
283 # First write the much simpler filename/fileid dictionaries
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
284 dbfil = {'WORDS':None, 'FILES':self.files, 'FILEIDS':self.fileids}
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
285 open(self.indexdb+'-','wb').write(zlib.compress(marshal.dumps(dbfil)))
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
286
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
287 # The hard part is splitting the word dictionary up, of course
827
0a2c1f5e0e5a We're indexing numbers now, and _underscore words
Richard Jones <richard@users.sourceforge.net>
parents: 826
diff changeset
288 letters = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ#_"
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
289 segdicts = {} # Need batch of empty dicts
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
290 for segment in letters:
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
291 segdicts[segment] = {}
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
292 for word, entry in self.words.items(): # Split into segment dicts
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
293 initchar = word[0].upper()
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
294 segdicts[initchar][word] = entry
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
295
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
296 # save
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
297 for initchar in letters:
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
298 db = {'WORDS':segdicts[initchar], 'FILES':None, 'FILEIDS':None}
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
299 pickle_str = marshal.dumps(db)
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
300 filename = self.indexdb + initchar
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
301 pickle_fh = open(filename, 'wb')
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
302 pickle_fh.write(zlib.compress(pickle_str))
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
303 os.chmod(filename, 0664)
749
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
304
833
b80aaedba3db Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents: 827
diff changeset
305 # save done
b80aaedba3db Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents: 827
diff changeset
306 self.changed = 0
b80aaedba3db Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents: 827
diff changeset
307
826
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
308 def purge_entry(self, identifier):
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
309 ''' Remove a file from file index and word index
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
310 '''
826
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
311 if not self.files.has_key(identifier):
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
312 return
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
313
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
314 file_index = self.files[identifier][0]
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
315 del self.files[identifier]
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
316 del self.fileids[file_index]
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
317
749
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
318 # The much harder part, cleanup the word index
826
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
319 for key, occurs in self.words.items():
749
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
320 if occurs.has_key(file_index):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
321 del occurs[file_index]
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
322
833
b80aaedba3db Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents: 827
diff changeset
323 # save needed
b80aaedba3db Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents: 827
diff changeset
324 self.changed = 1
b80aaedba3db Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents: 827
diff changeset
325
749
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
326 def index_loaded(self):
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
327 return (hasattr(self,'fileids') and hasattr(self,'files') and
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
328 hasattr(self,'words'))
749
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
329
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
330 #
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
331 #$Log: not supported by cvs2svn $
834
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
332 #Revision 1.7 2002/07/09 21:38:43 richard
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
333 #Only save the index if the thing is loaded and changed. Also, don't load
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
334 #the index just for a save.
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
335 #
833
b80aaedba3db Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents: 827
diff changeset
336 #Revision 1.6 2002/07/09 04:26:44 richard
b80aaedba3db Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents: 827
diff changeset
337 #We're indexing numbers now, and _underscore words
b80aaedba3db Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents: 827
diff changeset
338 #
827
0a2c1f5e0e5a We're indexing numbers now, and _underscore words
Richard Jones <richard@users.sourceforge.net>
parents: 826
diff changeset
339 #Revision 1.5 2002/07/09 04:19:09 richard
0a2c1f5e0e5a We're indexing numbers now, and _underscore words
Richard Jones <richard@users.sourceforge.net>
parents: 826
diff changeset
340 #Added reindex command to roundup-admin.
0a2c1f5e0e5a We're indexing numbers now, and _underscore words
Richard Jones <richard@users.sourceforge.net>
parents: 826
diff changeset
341 #Fixed reindex on first access.
0a2c1f5e0e5a We're indexing numbers now, and _underscore words
Richard Jones <richard@users.sourceforge.net>
parents: 826
diff changeset
342 #Also fixed reindexing of entries that change.
0a2c1f5e0e5a We're indexing numbers now, and _underscore words
Richard Jones <richard@users.sourceforge.net>
parents: 826
diff changeset
343 #
826
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
344 #Revision 1.4 2002/07/09 03:02:52 richard
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
345 #More indexer work:
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
346 #- all String properties may now be indexed too. Currently there's a bit of
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
347 # "issue" specific code in the actual searching which needs to be
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
348 # addressed. In a nutshell:
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
349 # + pass 'indexme="yes"' as a String() property initialisation arg, eg:
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
350 # file = FileClass(db, "file", name=String(), type=String(),
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
351 # comment=String(indexme="yes"))
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
352 # + the comment will then be indexed and be searchable, with the results
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
353 # related back to the issue that the file is linked to
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
354 #- as a result of this work, the FileClass has a default MIME type that may
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
355 # be overridden in a subclass, or by the use of a "type" property as is
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
356 # done in the default templates.
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
357 #- the regeneration of the indexes (if necessary) is done once the schema is
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
358 # set up in the dbinit.
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
359 #
825
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
360 #Revision 1.3 2002/07/08 06:58:15 richard
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
361 #cleaned up the indexer code:
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
362 # - it splits more words out (much simpler, faster splitter)
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
363 # - removed code we'll never use (roundup.roundup_indexer has the full
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
364 # implementation, and replaces roundup.indexer)
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
365 # - only index text/plain and rfc822/message (ideas for other text formats to
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
366 # index are welcome)
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
367 # - added simple unit test for indexer. Needs more tests for regression.
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
368 #
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
369 #Revision 1.2 2002/05/25 07:16:24 rochecompaan
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
370 #Merged search_indexing-branch with HEAD
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
371 #
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
372 #Revision 1.1.2.3 2002/05/02 11:52:12 rochecompaan
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
373 #Fixed small bug that prevented indexes from being generated.
749
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
374 #
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
375 #Revision 1.1.2.2 2002/04/19 19:54:42 rochecompaan
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
376 #cgi_client.py
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
377 # removed search link for the time being
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
378 # moved rendering of matches to htmltemplate
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
379 #hyperdb.py
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
380 # filtering of nodes on full text search incorporated in filter method
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
381 #roundupdb.py
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
382 # added paramater to call of filter method
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
383 #roundup_indexer.py
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
384 # added search method to RoundupIndexer class
749
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
385 #
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
386 #Revision 1.1.2.1 2002/04/03 11:55:57 rochecompaan
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
387 # . Added feature #526730 - search for messages capability
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
388 #

Roundup Issue Tracker: http://roundup-tracker.org/