Mercurial > p > roundup > code
annotate roundup/indexer.py @ 834:568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
This is used by indexer.search so it can do just one find for all the
index matches.
This was already confusing code, but for common terms (lots of index matches),
it is enormously faster.
| author | Gordon B. McMillan <gmcm@users.sourceforge.net> |
|---|---|
| date | Tue, 09 Jul 2002 21:53:38 +0000 |
| parents | b80aaedba3db |
| children | ba38e1e718f2 |
| rev | line source |
|---|---|
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
1 # |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
2 # This module is derived from the module described at: |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
3 # http://gnosis.cx/publish/programming/charming_python_15.txt |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
4 # |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
5 # Author: David Mertz (mertz@gnosis.cx) |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
6 # Thanks to: Pat Knight (p.knight@ktgroup.co.uk) |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
7 # Gregory Popovitch (greg@gpy.com) |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
8 # |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
9 # The original module was released under this license, and remains under |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
10 # it: |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
11 # |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
12 # This file is released to the public domain. I (dqm) would |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
13 # appreciate it if you choose to keep derived works under terms |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
14 # that promote freedom, but obviously am giving up any rights |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
15 # to compel such. |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
16 # |
|
834
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
17 #$Id: indexer.py,v 1.8 2002-07-09 21:53:38 gmcm Exp $ |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
18 ''' |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
19 This module provides an indexer class, RoundupIndexer, that stores text |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
20 indices in a roundup instance. This class makes searching the content of |
|
834
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
21 messages, string properties and text files possible. |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
22 ''' |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
23 import os, shutil, re, mimetypes, marshal, zlib, errno |
|
834
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
24 from hyperdb import Link, Multilink |
|
749
51c425129b35
Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
25 |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
26 class Indexer: |
|
825
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
27 ''' Indexes information from roundup's hyperdb to allow efficient |
|
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
28 searching. |
|
826
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
29 |
|
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
30 Three structures are created by the indexer: |
|
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
31 files {identifier: (fileid, wordcount)} |
|
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
32 words {word: {fileid: count}} |
|
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
33 fileids {fileid: identifier} |
|
834
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
34 where identifier is (classname, nodeid, propertyname) |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
35 ''' |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
36 def __init__(self, db_path): |
|
826
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
37 self.indexdb_path = os.path.join(db_path, 'indexes') |
|
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
38 self.indexdb = os.path.join(self.indexdb_path, 'index.db') |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
39 self.reindex = 0 |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
40 self.quiet = 9 |
|
833
b80aaedba3db
Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents:
827
diff
changeset
|
41 self.changed = 0 |
|
749
51c425129b35
Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
42 |
|
825
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
43 # see if we need to reindex because of a change in code |
|
826
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
44 if (not os.path.exists(self.indexdb_path) or |
|
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
45 not os.path.exists(os.path.join(self.indexdb_path, 'version'))): |
|
825
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
46 # TODO: if the version file exists (in the future) we'll want to |
|
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
47 # check the value in it - for now the file itself is a flag |
|
826
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
48 self.force_reindex() |
|
749
51c425129b35
Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
49 |
|
826
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
50 def force_reindex(self): |
|
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
51 '''Force a reindex condition |
|
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
52 ''' |
|
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
53 if os.path.exists(self.indexdb_path): |
|
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
54 shutil.rmtree(self.indexdb_path) |
|
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
55 os.makedirs(self.indexdb_path) |
|
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
56 os.chmod(self.indexdb_path, 0775) |
|
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
57 open(os.path.join(self.indexdb_path, 'version'), 'w').write('1\n') |
|
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
58 self.reindex = 1 |
|
833
b80aaedba3db
Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents:
827
diff
changeset
|
59 self.changed = 1 |
|
749
51c425129b35
Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
60 |
|
825
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
61 def should_reindex(self): |
|
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
62 '''Should we reindex? |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
63 ''' |
|
825
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
64 return self.reindex |
|
749
51c425129b35
Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
65 |
|
825
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
66 def add_text(self, identifier, text, mime_type='text/plain'): |
|
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
67 ''' Add some text associated with the (classname, nodeid, property) |
|
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
68 identifier. |
|
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
69 ''' |
|
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
70 # make sure the index is loaded |
|
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
71 self.load_index() |
|
749
51c425129b35
Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
72 |
|
826
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
73 # remove old entries for this identifier |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
74 if self.files.has_key(identifier): |
|
826
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
75 self.purge_entry(identifier) |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
76 |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
77 # split into words |
|
825
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
78 words = self.splitter(text, mime_type) |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
79 |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
80 # Find new file index, and assign it to identifier |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
81 # (_TOP uses trick of negative to avoid conflict with file index) |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
82 self.files['_TOP'] = (self.files['_TOP'][0]-1, None) |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
83 file_index = abs(self.files['_TOP'][0]) |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
84 self.files[identifier] = (file_index, len(words)) |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
85 self.fileids[file_index] = identifier |
|
749
51c425129b35
Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
86 |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
87 # find the unique words |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
88 filedict = {} |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
89 for word in words: |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
90 if filedict.has_key(word): |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
91 filedict[word] = filedict[word]+1 |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
92 else: |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
93 filedict[word] = 1 |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
94 |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
95 # now add to the totals |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
96 for word in filedict.keys(): |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
97 # each word has a dict of {identifier: count} |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
98 if self.words.has_key(word): |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
99 entry = self.words[word] |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
100 else: |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
101 # new word |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
102 entry = {} |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
103 self.words[word] = entry |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
104 |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
105 # make a reference to the file for this word |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
106 entry[file_index] = filedict[word] |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
107 |
|
833
b80aaedba3db
Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents:
827
diff
changeset
|
108 # save needed |
|
b80aaedba3db
Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents:
827
diff
changeset
|
109 self.changed = 1 |
|
b80aaedba3db
Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents:
827
diff
changeset
|
110 |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
111 def splitter(self, text, ftype): |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
112 ''' Split the contents of a text string into a list of 'words' |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
113 ''' |
|
825
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
114 if ftype == 'text/plain': |
|
833
b80aaedba3db
Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents:
827
diff
changeset
|
115 words = self.text_splitter(text) |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
116 else: |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
117 return [] |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
118 return words |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
119 |
|
833
b80aaedba3db
Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents:
827
diff
changeset
|
120 def text_splitter(self, text): |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
121 """Split text/plain string into a list of words |
|
749
51c425129b35
Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
122 """ |
|
833
b80aaedba3db
Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents:
827
diff
changeset
|
123 # case insensitive |
|
b80aaedba3db
Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents:
827
diff
changeset
|
124 text = text.upper() |
|
749
51c425129b35
Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
125 |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
126 # Split the raw text, losing anything longer than 25 characters |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
127 # since that'll be gibberish (encoded text or somesuch) or shorter |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
128 # than 3 characters since those short words appear all over the |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
129 # place |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
130 return re.findall(r'\b\w{2,25}\b', text) |
|
749
51c425129b35
Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
131 |
|
825
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
132 def search(self, search_terms, klass, ignore={}, |
|
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
133 dre=re.compile(r'([^\d]+)(\d+)')): |
|
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
134 ''' Display search results looking for [search, terms] associated |
|
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
135 with the hyperdb Class "klass". Ignore hits on {class: property}. |
|
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
136 |
|
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
137 "dre" is a helper, not an argument. |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
138 ''' |
|
825
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
139 # do the index lookup |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
140 hits = self.find(search_terms) |
|
825
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
141 if not hits: |
|
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
142 return {} |
|
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
143 |
|
834
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
144 #designator_propname = {'msg': 'messages', 'file': 'files'} |
|
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
145 designator_propname = {} |
|
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
146 for nm, propclass in klass.getprops().items(): |
|
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
147 if isinstance(propclass, Link) or isinstance(propclass, Multilink): |
|
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
148 designator_propname[propclass.classname] = nm |
|
825
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
149 |
|
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
150 # build a dictionary of nodes and their associated messages |
|
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
151 # and files |
|
834
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
152 nodeids = {} # this is the answer |
|
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
153 propspec = {} # used to do the klass.find |
|
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
154 for propname in designator_propname.values(): |
|
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
155 propspec[propname] = {} # used as a set (value doesn't matter) |
|
825
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
156 for classname, nodeid, property in hits.values(): |
|
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
157 # skip this result if we don't care about this class/property |
|
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
158 if ignore.has_key((classname, property)): |
|
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
159 continue |
|
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
160 |
|
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
161 # if it's a property on klass, it's easy |
|
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
162 if classname == klass.classname: |
|
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
163 if not nodeids.has_key(nodeid): |
|
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
164 nodeids[nodeid] = {} |
|
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
165 continue |
|
749
51c425129b35
Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
166 |
|
834
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
167 # it's a linked class - set up to do the klass.find |
|
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
168 linkprop = designator_propname[classname] # eg, msg -> messages |
|
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
169 propspec[linkprop][nodeid] = 1 |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
170 |
|
834
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
171 # retain only the meaningful entries |
|
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
172 for propname, idset in propspec.items(): |
|
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
173 if not idset: |
|
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
174 del propspec[propname] |
|
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
175 |
|
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
176 # klass.find tells me the klass nodeids the linked nodes relate to |
|
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
177 for resid in klass.find(**propspec): |
|
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
178 resid = str(resid) |
|
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
179 if not nodeids.has_key(id): |
|
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
180 nodeids[resid] = {} |
|
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
181 node_dict = nodeids[resid] |
|
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
182 # now figure out where it came from |
|
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
183 for linkprop in propspec.keys(): |
|
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
184 for nodeid in klass.get(resid, linkprop): |
|
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
185 if propspec[linkprop].has_key(nodeid): |
|
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
186 # OK, this node[propname] has a winner |
|
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
187 if not node_dict.has_key(linkprop): |
|
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
188 node_dict[linkprop] = [nodeid] |
|
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
189 else: |
|
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
190 node_dict[linkprop].append(nodeid) |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
191 return nodeids |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
192 |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
193 # we override this to ignore not 2 < word < 25 and also to fix a bug - |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
194 # the (fail) case. |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
195 def find(self, wordlist): |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
196 ''' Locate files that match ALL the words in wordlist |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
197 ''' |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
198 if not hasattr(self, 'words'): |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
199 self.load_index() |
|
749
51c425129b35
Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
200 self.load_index(wordlist=wordlist) |
|
51c425129b35
Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
201 entries = {} |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
202 hits = None |
|
749
51c425129b35
Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
203 for word in wordlist: |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
204 if not 2 < len(word) < 25: |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
205 # word outside the bounds of what we index - ignore |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
206 continue |
|
833
b80aaedba3db
Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents:
827
diff
changeset
|
207 word = word.upper() |
|
749
51c425129b35
Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
208 entry = self.words.get(word) # For each word, get index |
|
51c425129b35
Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
209 entries[word] = entry # of matching files |
|
51c425129b35
Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
210 if not entry: # Nothing for this one word (fail) |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
211 return {} |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
212 if hits is None: |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
213 hits = {} |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
214 for k in entry.keys(): |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
215 hits[k] = self.fileids[k] |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
216 else: |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
217 # Eliminate hits for every non-match |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
218 for fileid in hits.keys(): |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
219 if not entry.has_key(fileid): |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
220 del hits[fileid] |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
221 if hits is None: |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
222 return {} |
|
749
51c425129b35
Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
223 return hits |
|
51c425129b35
Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
224 |
|
827
0a2c1f5e0e5a
We're indexing numbers now, and _underscore words
Richard Jones <richard@users.sourceforge.net>
parents:
826
diff
changeset
|
225 segments = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ#_-!" |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
226 def load_index(self, reload=0, wordlist=None): |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
227 # Unless reload is indicated, do not load twice |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
228 if self.index_loaded() and not reload: |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
229 return 0 |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
230 |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
231 # Ok, now let's actually load it |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
232 db = {'WORDS': {}, 'FILES': {'_TOP':(0,None)}, 'FILEIDS': {}} |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
233 |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
234 # Identify the relevant word-dictionary segments |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
235 if not wordlist: |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
236 segments = self.segments |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
237 else: |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
238 segments = ['-','#'] |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
239 for word in wordlist: |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
240 segments.append(word[0].upper()) |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
241 |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
242 # Load the segments |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
243 for segment in segments: |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
244 try: |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
245 f = open(self.indexdb + segment, 'rb') |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
246 except IOError, error: |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
247 if error.errno != errno.ENOENT: |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
248 raise |
|
749
51c425129b35
Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
249 else: |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
250 pickle_str = zlib.decompress(f.read()) |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
251 f.close() |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
252 dbslice = marshal.loads(pickle_str) |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
253 if dbslice.get('WORDS'): |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
254 # if it has some words, add them |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
255 for word, entry in dbslice['WORDS'].items(): |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
256 db['WORDS'][word] = entry |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
257 if dbslice.get('FILES'): |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
258 # if it has some files, add them |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
259 db['FILES'] = dbslice['FILES'] |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
260 if dbslice.get('FILEIDS'): |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
261 # if it has fileids, add them |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
262 db['FILEIDS'] = dbslice['FILEIDS'] |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
263 |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
264 self.words = db['WORDS'] |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
265 self.files = db['FILES'] |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
266 self.fileids = db['FILEIDS'] |
|
833
b80aaedba3db
Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents:
827
diff
changeset
|
267 self.changed = 0 |
|
749
51c425129b35
Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
268 |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
269 def save_index(self): |
|
833
b80aaedba3db
Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents:
827
diff
changeset
|
270 # only save if the index is loaded and changed |
|
b80aaedba3db
Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents:
827
diff
changeset
|
271 if not self.index_loaded() or not self.changed: |
|
b80aaedba3db
Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents:
827
diff
changeset
|
272 return |
|
825
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
273 |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
274 # brutal space saver... delete all the small segments |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
275 for segment in self.segments: |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
276 try: |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
277 os.remove(self.indexdb + segment) |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
278 except OSError: |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
279 # probably just nonexistent segment index file |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
280 # TODO: make sure it's an EEXIST |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
281 pass |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
282 |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
283 # First write the much simpler filename/fileid dictionaries |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
284 dbfil = {'WORDS':None, 'FILES':self.files, 'FILEIDS':self.fileids} |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
285 open(self.indexdb+'-','wb').write(zlib.compress(marshal.dumps(dbfil))) |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
286 |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
287 # The hard part is splitting the word dictionary up, of course |
|
827
0a2c1f5e0e5a
We're indexing numbers now, and _underscore words
Richard Jones <richard@users.sourceforge.net>
parents:
826
diff
changeset
|
288 letters = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ#_" |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
289 segdicts = {} # Need batch of empty dicts |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
290 for segment in letters: |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
291 segdicts[segment] = {} |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
292 for word, entry in self.words.items(): # Split into segment dicts |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
293 initchar = word[0].upper() |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
294 segdicts[initchar][word] = entry |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
295 |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
296 # save |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
297 for initchar in letters: |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
298 db = {'WORDS':segdicts[initchar], 'FILES':None, 'FILEIDS':None} |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
299 pickle_str = marshal.dumps(db) |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
300 filename = self.indexdb + initchar |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
301 pickle_fh = open(filename, 'wb') |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
302 pickle_fh.write(zlib.compress(pickle_str)) |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
303 os.chmod(filename, 0664) |
|
749
51c425129b35
Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
304 |
|
833
b80aaedba3db
Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents:
827
diff
changeset
|
305 # save done |
|
b80aaedba3db
Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents:
827
diff
changeset
|
306 self.changed = 0 |
|
b80aaedba3db
Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents:
827
diff
changeset
|
307 |
|
826
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
308 def purge_entry(self, identifier): |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
309 ''' Remove a file from file index and word index |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
310 ''' |
|
826
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
311 if not self.files.has_key(identifier): |
|
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
312 return |
|
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
313 |
|
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
314 file_index = self.files[identifier][0] |
|
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
315 del self.files[identifier] |
|
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
316 del self.fileids[file_index] |
|
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
317 |
|
749
51c425129b35
Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
318 # The much harder part, cleanup the word index |
|
826
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
319 for key, occurs in self.words.items(): |
|
749
51c425129b35
Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
320 if occurs.has_key(file_index): |
|
51c425129b35
Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
321 del occurs[file_index] |
|
51c425129b35
Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
322 |
|
833
b80aaedba3db
Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents:
827
diff
changeset
|
323 # save needed |
|
b80aaedba3db
Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents:
827
diff
changeset
|
324 self.changed = 1 |
|
b80aaedba3db
Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents:
827
diff
changeset
|
325 |
|
749
51c425129b35
Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
326 def index_loaded(self): |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
327 return (hasattr(self,'fileids') and hasattr(self,'files') and |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
328 hasattr(self,'words')) |
|
749
51c425129b35
Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
329 |
|
51c425129b35
Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
330 # |
|
51c425129b35
Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
331 #$Log: not supported by cvs2svn $ |
|
834
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
332 #Revision 1.7 2002/07/09 21:38:43 richard |
|
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
333 #Only save the index if the thing is loaded and changed. Also, don't load |
|
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
334 #the index just for a save. |
|
568eed5fb4fd
Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents:
833
diff
changeset
|
335 # |
|
833
b80aaedba3db
Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents:
827
diff
changeset
|
336 #Revision 1.6 2002/07/09 04:26:44 richard |
|
b80aaedba3db
Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents:
827
diff
changeset
|
337 #We're indexing numbers now, and _underscore words |
|
b80aaedba3db
Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents:
827
diff
changeset
|
338 # |
|
827
0a2c1f5e0e5a
We're indexing numbers now, and _underscore words
Richard Jones <richard@users.sourceforge.net>
parents:
826
diff
changeset
|
339 #Revision 1.5 2002/07/09 04:19:09 richard |
|
0a2c1f5e0e5a
We're indexing numbers now, and _underscore words
Richard Jones <richard@users.sourceforge.net>
parents:
826
diff
changeset
|
340 #Added reindex command to roundup-admin. |
|
0a2c1f5e0e5a
We're indexing numbers now, and _underscore words
Richard Jones <richard@users.sourceforge.net>
parents:
826
diff
changeset
|
341 #Fixed reindex on first access. |
|
0a2c1f5e0e5a
We're indexing numbers now, and _underscore words
Richard Jones <richard@users.sourceforge.net>
parents:
826
diff
changeset
|
342 #Also fixed reindexing of entries that change. |
|
0a2c1f5e0e5a
We're indexing numbers now, and _underscore words
Richard Jones <richard@users.sourceforge.net>
parents:
826
diff
changeset
|
343 # |
|
826
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
344 #Revision 1.4 2002/07/09 03:02:52 richard |
|
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
345 #More indexer work: |
|
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
346 #- all String properties may now be indexed too. Currently there's a bit of |
|
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
347 # "issue" specific code in the actual searching which needs to be |
|
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
348 # addressed. In a nutshell: |
|
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
349 # + pass 'indexme="yes"' as a String() property initialisation arg, eg: |
|
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
350 # file = FileClass(db, "file", name=String(), type=String(), |
|
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
351 # comment=String(indexme="yes")) |
|
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
352 # + the comment will then be indexed and be searchable, with the results |
|
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
353 # related back to the issue that the file is linked to |
|
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
354 #- as a result of this work, the FileClass has a default MIME type that may |
|
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
355 # be overridden in a subclass, or by the use of a "type" property as is |
|
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
356 # done in the default templates. |
|
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
357 #- the regeneration of the indexes (if necessary) is done once the schema is |
|
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
358 # set up in the dbinit. |
|
6d7a45c8464a
Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents:
825
diff
changeset
|
359 # |
|
825
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
360 #Revision 1.3 2002/07/08 06:58:15 richard |
|
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
361 #cleaned up the indexer code: |
|
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
362 # - it splits more words out (much simpler, faster splitter) |
|
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
363 # - removed code we'll never use (roundup.roundup_indexer has the full |
|
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
364 # implementation, and replaces roundup.indexer) |
|
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
365 # - only index text/plain and rfc822/message (ideas for other text formats to |
|
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
366 # index are welcome) |
|
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
367 # - added simple unit test for indexer. Needs more tests for regression. |
|
0779ea9f1f18
More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents:
818
diff
changeset
|
368 # |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
369 #Revision 1.2 2002/05/25 07:16:24 rochecompaan |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
370 #Merged search_indexing-branch with HEAD |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
371 # |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
372 #Revision 1.1.2.3 2002/05/02 11:52:12 rochecompaan |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
373 #Fixed small bug that prevented indexes from being generated. |
|
749
51c425129b35
Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
374 # |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
375 #Revision 1.1.2.2 2002/04/19 19:54:42 rochecompaan |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
376 #cgi_client.py |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
377 # removed search link for the time being |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
378 # moved rendering of matches to htmltemplate |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
379 #hyperdb.py |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
380 # filtering of nodes on full text search incorporated in filter method |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
381 #roundupdb.py |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
382 # added paramater to call of filter method |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
383 #roundup_indexer.py |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
384 # added search method to RoundupIndexer class |
|
749
51c425129b35
Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff
changeset
|
385 # |
|
818
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
386 #Revision 1.1.2.1 2002/04/03 11:55:57 rochecompaan |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
387 # . Added feature #526730 - search for messages capability |
|
254b8d112eec
cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents:
749
diff
changeset
|
388 # |
