annotate roundup/indexer.py @ 2077:3e0961d6d44d

Added the "actor" property. Metakit backend not done (still not confident I know how it's supposed to work ;) Currently it will come up as NULL in the RDBMS backends for older items. The *dbm backends will look up the journal. I hope to remedy the former before 0.7's release. Fixed a bunch of migration issues in the rdbms backends while I was at it (index changes for key prop changes) and simplified the class table update code for RDBMSes that have "alter table" in their command set (ie. not sqlite) ... migration from "version 1" to "version 2" still hasn't actually been tested yet though.
author Richard Jones <richard@users.sourceforge.net>
date Mon, 15 Mar 2004 05:50:20 +0000
parents fc52d57c6c3e
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
1 #
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
2 # This module is derived from the module described at:
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
3 # http://gnosis.cx/publish/programming/charming_python_15.txt
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
4 #
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
5 # Author: David Mertz (mertz@gnosis.cx)
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
6 # Thanks to: Pat Knight (p.knight@ktgroup.co.uk)
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
7 # Gregory Popovitch (greg@gpy.com)
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
8 #
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
9 # The original module was released under this license, and remains under
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
10 # it:
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
11 #
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
12 # This file is released to the public domain. I (dqm) would
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
13 # appreciate it if you choose to keep derived works under terms
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
14 # that promote freedom, but obviously am giving up any rights
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
15 # to compel such.
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
16 #
2005
fc52d57c6c3e documentation cleanup
Richard Jones <richard@users.sourceforge.net>
parents: 1986
diff changeset
17 #$Id: indexer.py,v 1.18 2004-02-11 23:55:08 richard Exp $
fc52d57c6c3e documentation cleanup
Richard Jones <richard@users.sourceforge.net>
parents: 1986
diff changeset
18 '''This module provides an indexer class, RoundupIndexer, that stores text
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
19 indices in a roundup instance. This class makes searching the content of
834
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
20 messages, string properties and text files possible.
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
21 '''
2005
fc52d57c6c3e documentation cleanup
Richard Jones <richard@users.sourceforge.net>
parents: 1986
diff changeset
22 __docformat__ = 'restructuredtext'
fc52d57c6c3e documentation cleanup
Richard Jones <richard@users.sourceforge.net>
parents: 1986
diff changeset
23
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
24 import os, shutil, re, mimetypes, marshal, zlib, errno
834
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
25 from hyperdb import Link, Multilink
749
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
26
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
27 class Indexer:
2005
fc52d57c6c3e documentation cleanup
Richard Jones <richard@users.sourceforge.net>
parents: 1986
diff changeset
28 '''Indexes information from roundup's hyperdb to allow efficient
fc52d57c6c3e documentation cleanup
Richard Jones <richard@users.sourceforge.net>
parents: 1986
diff changeset
29 searching.
826
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
30
2005
fc52d57c6c3e documentation cleanup
Richard Jones <richard@users.sourceforge.net>
parents: 1986
diff changeset
31 Three structures are created by the indexer::
fc52d57c6c3e documentation cleanup
Richard Jones <richard@users.sourceforge.net>
parents: 1986
diff changeset
32
826
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
33 files {identifier: (fileid, wordcount)}
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
34 words {word: {fileid: count}}
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
35 fileids {fileid: identifier}
2005
fc52d57c6c3e documentation cleanup
Richard Jones <richard@users.sourceforge.net>
parents: 1986
diff changeset
36
fc52d57c6c3e documentation cleanup
Richard Jones <richard@users.sourceforge.net>
parents: 1986
diff changeset
37 where identifier is (classname, nodeid, propertyname)
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
38 '''
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
39 def __init__(self, db_path):
826
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
40 self.indexdb_path = os.path.join(db_path, 'indexes')
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
41 self.indexdb = os.path.join(self.indexdb_path, 'index.db')
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
42 self.reindex = 0
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
43 self.quiet = 9
833
b80aaedba3db Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents: 827
diff changeset
44 self.changed = 0
749
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
45
825
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
46 # see if we need to reindex because of a change in code
863
ba38e1e718f2 Some TODOs
Richard Jones <richard@users.sourceforge.net>
parents: 834
diff changeset
47 version = os.path.join(self.indexdb_path, 'version')
826
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
48 if (not os.path.exists(self.indexdb_path) or
863
ba38e1e718f2 Some TODOs
Richard Jones <richard@users.sourceforge.net>
parents: 834
diff changeset
49 not os.path.exists(version)):
ba38e1e718f2 Some TODOs
Richard Jones <richard@users.sourceforge.net>
parents: 834
diff changeset
50 # for now the file itself is a flag
826
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
51 self.force_reindex()
863
ba38e1e718f2 Some TODOs
Richard Jones <richard@users.sourceforge.net>
parents: 834
diff changeset
52 elif os.path.exists(version):
ba38e1e718f2 Some TODOs
Richard Jones <richard@users.sourceforge.net>
parents: 834
diff changeset
53 version = open(version).read()
ba38e1e718f2 Some TODOs
Richard Jones <richard@users.sourceforge.net>
parents: 834
diff changeset
54 # check the value and reindex if it's not the latest
880
de3da99a7c02 Add Number and Boolean types to hyperdb.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 867
diff changeset
55 if version.strip() != '1':
863
ba38e1e718f2 Some TODOs
Richard Jones <richard@users.sourceforge.net>
parents: 834
diff changeset
56 self.force_reindex()
749
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
57
826
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
58 def force_reindex(self):
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
59 '''Force a reindex condition
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
60 '''
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
61 if os.path.exists(self.indexdb_path):
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
62 shutil.rmtree(self.indexdb_path)
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
63 os.makedirs(self.indexdb_path)
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
64 os.chmod(self.indexdb_path, 0775)
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
65 open(os.path.join(self.indexdb_path, 'version'), 'w').write('1\n')
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
66 self.reindex = 1
833
b80aaedba3db Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents: 827
diff changeset
67 self.changed = 1
749
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
68
825
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
69 def should_reindex(self):
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
70 '''Should we reindex?
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
71 '''
825
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
72 return self.reindex
749
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
73
825
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
74 def add_text(self, identifier, text, mime_type='text/plain'):
2005
fc52d57c6c3e documentation cleanup
Richard Jones <richard@users.sourceforge.net>
parents: 1986
diff changeset
75 '''Add some text associated with the (classname, nodeid, property)
fc52d57c6c3e documentation cleanup
Richard Jones <richard@users.sourceforge.net>
parents: 1986
diff changeset
76 identifier.
825
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
77 '''
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
78 # make sure the index is loaded
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
79 self.load_index()
749
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
80
826
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
81 # remove old entries for this identifier
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
82 if self.files.has_key(identifier):
826
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
83 self.purge_entry(identifier)
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
84
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
85 # split into words
825
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
86 words = self.splitter(text, mime_type)
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
87
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
88 # Find new file index, and assign it to identifier
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
89 # (_TOP uses trick of negative to avoid conflict with file index)
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
90 self.files['_TOP'] = (self.files['_TOP'][0]-1, None)
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
91 file_index = abs(self.files['_TOP'][0])
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
92 self.files[identifier] = (file_index, len(words))
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
93 self.fileids[file_index] = identifier
749
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
94
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
95 # find the unique words
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
96 filedict = {}
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
97 for word in words:
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
98 if filedict.has_key(word):
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
99 filedict[word] = filedict[word]+1
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
100 else:
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
101 filedict[word] = 1
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
102
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
103 # now add to the totals
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
104 for word in filedict.keys():
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
105 # each word has a dict of {identifier: count}
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
106 if self.words.has_key(word):
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
107 entry = self.words[word]
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
108 else:
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
109 # new word
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
110 entry = {}
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
111 self.words[word] = entry
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
112
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
113 # make a reference to the file for this word
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
114 entry[file_index] = filedict[word]
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
115
833
b80aaedba3db Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents: 827
diff changeset
116 # save needed
b80aaedba3db Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents: 827
diff changeset
117 self.changed = 1
b80aaedba3db Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents: 827
diff changeset
118
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
119 def splitter(self, text, ftype):
2005
fc52d57c6c3e documentation cleanup
Richard Jones <richard@users.sourceforge.net>
parents: 1986
diff changeset
120 '''Split the contents of a text string into a list of 'words'
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
121 '''
825
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
122 if ftype == 'text/plain':
833
b80aaedba3db Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents: 827
diff changeset
123 words = self.text_splitter(text)
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
124 else:
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
125 return []
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
126 return words
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
127
833
b80aaedba3db Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents: 827
diff changeset
128 def text_splitter(self, text):
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
129 """Split text/plain string into a list of words
749
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
130 """
833
b80aaedba3db Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents: 827
diff changeset
131 # case insensitive
b80aaedba3db Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents: 827
diff changeset
132 text = text.upper()
749
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
133
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
134 # Split the raw text, losing anything longer than 25 characters
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
135 # since that'll be gibberish (encoded text or somesuch) or shorter
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
136 # than 3 characters since those short words appear all over the
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
137 # place
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
138 return re.findall(r'\b\w{2,25}\b', text)
749
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
139
825
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
140 def search(self, search_terms, klass, ignore={},
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
141 dre=re.compile(r'([^\d]+)(\d+)')):
2005
fc52d57c6c3e documentation cleanup
Richard Jones <richard@users.sourceforge.net>
parents: 1986
diff changeset
142 '''Display search results looking for [search, terms] associated
fc52d57c6c3e documentation cleanup
Richard Jones <richard@users.sourceforge.net>
parents: 1986
diff changeset
143 with the hyperdb Class "klass". Ignore hits on {class: property}.
825
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
144
2005
fc52d57c6c3e documentation cleanup
Richard Jones <richard@users.sourceforge.net>
parents: 1986
diff changeset
145 "dre" is a helper, not an argument.
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
146 '''
825
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
147 # do the index lookup
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
148 hits = self.find(search_terms)
825
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
149 if not hits:
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
150 return {}
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
151
834
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
152 designator_propname = {}
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
153 for nm, propclass in klass.getprops().items():
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
154 if isinstance(propclass, Link) or isinstance(propclass, Multilink):
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
155 designator_propname[propclass.classname] = nm
825
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
156
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
157 # build a dictionary of nodes and their associated messages
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
158 # and files
1986
910b39f8c5b8 use the upload-supplied content-type if there is one
Richard Jones <richard@users.sourceforge.net>
parents: 1376
diff changeset
159 nodeids = {} # this is the answer
834
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
160 propspec = {} # used to do the klass.find
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
161 for propname in designator_propname.values():
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
162 propspec[propname] = {} # used as a set (value doesn't matter)
825
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
163 for classname, nodeid, property in hits.values():
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
164 # skip this result if we don't care about this class/property
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
165 if ignore.has_key((classname, property)):
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
166 continue
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
167
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
168 # if it's a property on klass, it's easy
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
169 if classname == klass.classname:
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
170 if not nodeids.has_key(nodeid):
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
171 nodeids[nodeid] = {}
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
172 continue
749
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
173
1206
728a0809183e handle multiple unrelated indexed classes
Richard Jones <richard@users.sourceforge.net>
parents: 1090
diff changeset
174 # make sure the class is a linked one, otherwise ignore
728a0809183e handle multiple unrelated indexed classes
Richard Jones <richard@users.sourceforge.net>
parents: 1090
diff changeset
175 if not designator_propname.has_key(classname):
728a0809183e handle multiple unrelated indexed classes
Richard Jones <richard@users.sourceforge.net>
parents: 1090
diff changeset
176 continue
728a0809183e handle multiple unrelated indexed classes
Richard Jones <richard@users.sourceforge.net>
parents: 1090
diff changeset
177
834
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
178 # it's a linked class - set up to do the klass.find
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
179 linkprop = designator_propname[classname] # eg, msg -> messages
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
180 propspec[linkprop][nodeid] = 1
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
181
834
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
182 # retain only the meaningful entries
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
183 for propname, idset in propspec.items():
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
184 if not idset:
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
185 del propspec[propname]
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
186
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
187 # klass.find tells me the klass nodeids the linked nodes relate to
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
188 for resid in klass.find(**propspec):
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
189 resid = str(resid)
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
190 if not nodeids.has_key(id):
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
191 nodeids[resid] = {}
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
192 node_dict = nodeids[resid]
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
193 # now figure out where it came from
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
194 for linkprop in propspec.keys():
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
195 for nodeid in klass.get(resid, linkprop):
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
196 if propspec[linkprop].has_key(nodeid):
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
197 # OK, this node[propname] has a winner
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
198 if not node_dict.has_key(linkprop):
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
199 node_dict[linkprop] = [nodeid]
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
200 else:
568eed5fb4fd Optimize Class.find so that the propspec can contain a set of ids to match.
Gordon B. McMillan <gmcm@users.sourceforge.net>
parents: 833
diff changeset
201 node_dict[linkprop].append(nodeid)
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
202 return nodeids
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
203
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
204 # we override this to ignore not 2 < word < 25 and also to fix a bug -
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
205 # the (fail) case.
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
206 def find(self, wordlist):
2005
fc52d57c6c3e documentation cleanup
Richard Jones <richard@users.sourceforge.net>
parents: 1986
diff changeset
207 '''Locate files that match ALL the words in wordlist
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
208 '''
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
209 if not hasattr(self, 'words'):
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
210 self.load_index()
749
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
211 self.load_index(wordlist=wordlist)
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
212 entries = {}
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
213 hits = None
749
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
214 for word in wordlist:
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
215 if not 2 < len(word) < 25:
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
216 # word outside the bounds of what we index - ignore
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
217 continue
833
b80aaedba3db Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents: 827
diff changeset
218 word = word.upper()
1376
0c736e2f1dd5 .get() was intentional after all
Richard Jones <richard@users.sourceforge.net>
parents: 1365
diff changeset
219 entry = self.words.get(word) # For each word, get index
0c736e2f1dd5 .get() was intentional after all
Richard Jones <richard@users.sourceforge.net>
parents: 1365
diff changeset
220 entries[word] = entry # of matching files
0c736e2f1dd5 .get() was intentional after all
Richard Jones <richard@users.sourceforge.net>
parents: 1365
diff changeset
221 if not entry: # Nothing for this one word (fail)
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
222 return {}
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
223 if hits is None:
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
224 hits = {}
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
225 for k in entry.keys():
1365
4884fb0860f9 fixed rdbms searching by ID [SF#666615]
Richard Jones <richard@users.sourceforge.net>
parents: 1206
diff changeset
226 if not self.fileids.has_key(k):
4884fb0860f9 fixed rdbms searching by ID [SF#666615]
Richard Jones <richard@users.sourceforge.net>
parents: 1206
diff changeset
227 raise ValueError, 'Index is corrupted: re-generate it'
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
228 hits[k] = self.fileids[k]
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
229 else:
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
230 # Eliminate hits for every non-match
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
231 for fileid in hits.keys():
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
232 if not entry.has_key(fileid):
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
233 del hits[fileid]
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
234 if hits is None:
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
235 return {}
749
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
236 return hits
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
237
827
0a2c1f5e0e5a We're indexing numbers now, and _underscore words
Richard Jones <richard@users.sourceforge.net>
parents: 826
diff changeset
238 segments = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ#_-!"
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
239 def load_index(self, reload=0, wordlist=None):
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
240 # Unless reload is indicated, do not load twice
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
241 if self.index_loaded() and not reload:
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
242 return 0
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
243
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
244 # Ok, now let's actually load it
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
245 db = {'WORDS': {}, 'FILES': {'_TOP':(0,None)}, 'FILEIDS': {}}
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
246
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
247 # Identify the relevant word-dictionary segments
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
248 if not wordlist:
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
249 segments = self.segments
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
250 else:
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
251 segments = ['-','#']
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
252 for word in wordlist:
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
253 segments.append(word[0].upper())
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
254
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
255 # Load the segments
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
256 for segment in segments:
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
257 try:
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
258 f = open(self.indexdb + segment, 'rb')
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
259 except IOError, error:
863
ba38e1e718f2 Some TODOs
Richard Jones <richard@users.sourceforge.net>
parents: 834
diff changeset
260 # probably just nonexistent segment index file
ba38e1e718f2 Some TODOs
Richard Jones <richard@users.sourceforge.net>
parents: 834
diff changeset
261 if error.errno != errno.ENOENT: raise
749
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
262 else:
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
263 pickle_str = zlib.decompress(f.read())
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
264 f.close()
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
265 dbslice = marshal.loads(pickle_str)
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
266 if dbslice.get('WORDS'):
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
267 # if it has some words, add them
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
268 for word, entry in dbslice['WORDS'].items():
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
269 db['WORDS'][word] = entry
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
270 if dbslice.get('FILES'):
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
271 # if it has some files, add them
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
272 db['FILES'] = dbslice['FILES']
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
273 if dbslice.get('FILEIDS'):
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
274 # if it has fileids, add them
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
275 db['FILEIDS'] = dbslice['FILEIDS']
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
276
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
277 self.words = db['WORDS']
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
278 self.files = db['FILES']
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
279 self.fileids = db['FILEIDS']
833
b80aaedba3db Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents: 827
diff changeset
280 self.changed = 0
749
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
281
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
282 def save_index(self):
833
b80aaedba3db Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents: 827
diff changeset
283 # only save if the index is loaded and changed
b80aaedba3db Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents: 827
diff changeset
284 if not self.index_loaded() or not self.changed:
b80aaedba3db Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents: 827
diff changeset
285 return
825
0779ea9f1f18 More indexer work:
Richard Jones <richard@users.sourceforge.net>
parents: 818
diff changeset
286
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
287 # brutal space saver... delete all the small segments
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
288 for segment in self.segments:
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
289 try:
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
290 os.remove(self.indexdb + segment)
863
ba38e1e718f2 Some TODOs
Richard Jones <richard@users.sourceforge.net>
parents: 834
diff changeset
291 except OSError, error:
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
292 # probably just nonexistent segment index file
867
Richard Jones <richard@users.sourceforge.net>
parents: 863
diff changeset
293 if error.errno != errno.ENOENT: raise
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
294
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
295 # First write the much simpler filename/fileid dictionaries
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
296 dbfil = {'WORDS':None, 'FILES':self.files, 'FILEIDS':self.fileids}
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
297 open(self.indexdb+'-','wb').write(zlib.compress(marshal.dumps(dbfil)))
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
298
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
299 # The hard part is splitting the word dictionary up, of course
827
0a2c1f5e0e5a We're indexing numbers now, and _underscore words
Richard Jones <richard@users.sourceforge.net>
parents: 826
diff changeset
300 letters = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ#_"
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
301 segdicts = {} # Need batch of empty dicts
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
302 for segment in letters:
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
303 segdicts[segment] = {}
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
304 for word, entry in self.words.items(): # Split into segment dicts
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
305 initchar = word[0].upper()
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
306 segdicts[initchar][word] = entry
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
307
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
308 # save
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
309 for initchar in letters:
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
310 db = {'WORDS':segdicts[initchar], 'FILES':None, 'FILEIDS':None}
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
311 pickle_str = marshal.dumps(db)
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
312 filename = self.indexdb + initchar
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
313 pickle_fh = open(filename, 'wb')
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
314 pickle_fh.write(zlib.compress(pickle_str))
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
315 os.chmod(filename, 0664)
749
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
316
833
b80aaedba3db Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents: 827
diff changeset
317 # save done
b80aaedba3db Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents: 827
diff changeset
318 self.changed = 0
b80aaedba3db Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents: 827
diff changeset
319
826
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
320 def purge_entry(self, identifier):
2005
fc52d57c6c3e documentation cleanup
Richard Jones <richard@users.sourceforge.net>
parents: 1986
diff changeset
321 '''Remove a file from file index and word index
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
322 '''
891
974a4b94c5e3 Implemented the destroy() method needed by the session database...
Richard Jones <richard@users.sourceforge.net>
parents: 880
diff changeset
323 self.load_index()
974a4b94c5e3 Implemented the destroy() method needed by the session database...
Richard Jones <richard@users.sourceforge.net>
parents: 880
diff changeset
324
826
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
325 if not self.files.has_key(identifier):
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
326 return
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
327
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
328 file_index = self.files[identifier][0]
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
329 del self.files[identifier]
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
330 del self.fileids[file_index]
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
331
749
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
332 # The much harder part, cleanup the word index
826
6d7a45c8464a Added reindex command to roundup-admin.
Richard Jones <richard@users.sourceforge.net>
parents: 825
diff changeset
333 for key, occurs in self.words.items():
749
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
334 if occurs.has_key(file_index):
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
335 del occurs[file_index]
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
336
833
b80aaedba3db Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents: 827
diff changeset
337 # save needed
b80aaedba3db Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents: 827
diff changeset
338 self.changed = 1
b80aaedba3db Only save the index if the thing is loaded and changed.
Richard Jones <richard@users.sourceforge.net>
parents: 827
diff changeset
339
749
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
340 def index_loaded(self):
818
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
341 return (hasattr(self,'fileids') and hasattr(self,'files') and
254b8d112eec cleaned up the indexer code:
Richard Jones <richard@users.sourceforge.net>
parents: 749
diff changeset
342 hasattr(self,'words'))
749
51c425129b35 Merged search_indexing-branch with HEAD
Roche Compaan <rochecompaan@users.sourceforge.net>
parents:
diff changeset
343
1090
9b910e8d987d removed Log
Richard Jones <richard@users.sourceforge.net>
parents: 891
diff changeset
344 # vim: set filetype=python ts=4 sw=4 et si

Roundup Issue Tracker: http://roundup-tracker.org/